- We have seen different data types in the first session.
- One of them was
factor
, representing categorical data: - A person is male or female
- A plane is passenger, cargo or military
- Some good is produced in Spain, France, China or UK.
2019-04-03
factor
, representing categorical data:TRUE
or FALSE
(or 0
or 1
).TRUE
.0
corresponds to TRUE
or FALSE
is up to you. Just be consistent!is.male
. Let's define its pendant.df1 = data.frame(income=c(3000,5000,7500,3500), sex=c("male","female","male","female"))
df1$is.male = df1$sex == "male" df1$is.female = df1$sex == "female"
lm(income ~ is.male + is.female,df1)
df1$linear_comb = df1$is.male + df1$is.female df1
## income sex is.male is.female linear_comb ## 1 3000 male TRUE FALSE 1 ## 2 5000 female FALSE TRUE 1 ## 3 7500 male TRUE FALSE 1 ## 4 3500 female FALSE TRUE 1
is.male + is.female
is always equal 1
!is.male = 1 - is.female
. A perfect colinearity!male
they are not female
.lm1 = lm(income ~ is.female,df1) lm1
## ## Call: ## lm(formula = income ~ is.female, data = df1) ## ## Coefficients: ## (Intercept) is.femaleTRUE ## 5250 -1000
is.male
.male
now?is.female = 0
. So male
is subsumed in the intercept!is.female = 0
, i.e. \(\widehat{y} = b_0 + b_1 \cdot 0=\) 5250is.female
is \(b_1=\) -1000. It measures the difference in intercept from being female.library(ScPoEconometrics) launchApp("reg_dummy")
library(ScPoEconometrics) launchApp("reg_dummy_example")
factor
R
data type factor
can represent more than just 0
and 1
in terms of categories.factor
takes a numeric vector x
and a vector of labels
. Each value of x
is associated to a label
:factor(x = c(1,1,2,4,3,4),labels = c("HS","someCol","BA","MSc"))
## [1] HS HS someCol MSc BA MSc ## Levels: HS someCol BA MSc
factor
in an lm
object automatically chooses an omitted/reference category!data(Wages,package="Ecdat") # load data str(Wages)
## 'data.frame': 4165 obs. of 12 variables: ## $ exp : int 3 4 5 6 7 8 9 30 31 32 ... ## $ wks : int 32 43 40 39 42 35 32 34 27 33 ... ## $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ... ## $ ind : int 0 0 0 0 1 1 1 0 0 1 ... ## $ south : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ... ## $ smsa : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... ## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ... ## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ... ## $ ed : int 9 9 9 9 9 9 9 11 11 11 ... ## $ black : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ lwage : num 5.56 5.72 6 6 6.06 ...
lm_w = lm(lwage ~ exp, data = Wages) # setup fit summary(lm_w) # show output
## ## Call: ## lm(formula = lwage ~ exp, data = Wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.30153 -0.29144 0.02307 0.27927 1.97171 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.5014318 0.0144657 449.44 <2e-16 *** ## exp 0.0088101 0.0006378 13.81 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4513 on 4163 degrees of freedom ## Multiple R-squared: 0.04383, Adjusted R-squared: 0.0436 ## F-statistic: 190.8 on 1 and 4163 DF, p-value: < 2.2e-16
update
our lm
object as follows:lm_sex = update(lm_w, . ~ . + sex) # update lm_w with same LHS, same RHS, but add sex to it
## ## Call: ## lm(formula = lwage ~ exp + sex, data = Wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.87081 -0.26688 0.01733 0.26336 1.90325 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.1257661 0.0223319 274.31 <2e-16 *** ## exp 0.0076134 0.0006082 12.52 <2e-16 *** ## sexmale 0.4501101 0.0210974 21.34 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4286 on 4162 degrees of freedom ## Multiple R-squared: 0.1381, Adjusted R-squared: 0.1377 ## F-statistic: 333.4 on 2 and 4162 DF, p-value: < 2.2e-16
sexmale
! R
appends the offset level to the variable name.sexfemale
.female
is the reference category.exp
.