- We have seen different data types in the first session.
- One of them was
factor, representing categorical data: - A person is male or female
- A plane is passenger, cargo or military
- Some good is produced in Spain, France, China or UK.
2019-04-03
factor, representing categorical data:TRUE or FALSE (or 0 or 1).TRUE.0 corresponds to TRUE or FALSE is up to you. Just be consistent!is.male. Let's define its pendant.df1 = data.frame(income=c(3000,5000,7500,3500),
sex=c("male","female","male","female"))
df1$is.male = df1$sex == "male" df1$is.female = df1$sex == "female"
lm(income ~ is.male + is.female,df1)
df1$linear_comb = df1$is.male + df1$is.female df1
## income sex is.male is.female linear_comb ## 1 3000 male TRUE FALSE 1 ## 2 5000 female FALSE TRUE 1 ## 3 7500 male TRUE FALSE 1 ## 4 3500 female FALSE TRUE 1
is.male + is.female is always equal 1!is.male = 1 - is.female. A perfect colinearity!male they are not female.lm1 = lm(income ~ is.female,df1) lm1
## ## Call: ## lm(formula = income ~ is.female, data = df1) ## ## Coefficients: ## (Intercept) is.femaleTRUE ## 5250 -1000
is.male.male now?is.female = 0. So male is subsumed in the intercept!is.female = 0, i.e. \(\widehat{y} = b_0 + b_1 \cdot 0=\) 5250is.female is \(b_1=\) -1000. It measures the difference in intercept from being female.


library(ScPoEconometrics)
launchApp("reg_dummy")
library(ScPoEconometrics)
launchApp("reg_dummy_example")
factorR data type factor can represent more than just 0 and 1 in terms of categories.factor takes a numeric vector x and a vector of labels. Each value of x is associated to a label:factor(x = c(1,1,2,4,3,4),labels = c("HS","someCol","BA","MSc"))
## [1] HS HS someCol MSc BA MSc ## Levels: HS someCol BA MSc
factor in an lm object automatically chooses an omitted/reference category!data(Wages,package="Ecdat") # load data str(Wages)
## 'data.frame': 4165 obs. of 12 variables: ## $ exp : int 3 4 5 6 7 8 9 30 31 32 ... ## $ wks : int 32 43 40 39 42 35 32 34 27 33 ... ## $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ... ## $ ind : int 0 0 0 0 1 1 1 0 0 1 ... ## $ south : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ... ## $ smsa : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... ## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ... ## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ... ## $ ed : int 9 9 9 9 9 9 9 11 11 11 ... ## $ black : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ lwage : num 5.56 5.72 6 6 6.06 ...
lm_w = lm(lwage ~ exp, data = Wages) # setup fit summary(lm_w) # show output
## ## Call: ## lm(formula = lwage ~ exp, data = Wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.30153 -0.29144 0.02307 0.27927 1.97171 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.5014318 0.0144657 449.44 <2e-16 *** ## exp 0.0088101 0.0006378 13.81 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4513 on 4163 degrees of freedom ## Multiple R-squared: 0.04383, Adjusted R-squared: 0.0436 ## F-statistic: 190.8 on 1 and 4163 DF, p-value: < 2.2e-16

update our lm object as follows:lm_sex = update(lm_w, . ~ . + sex) # update lm_w with same LHS, same RHS, but add sex to it
## ## Call: ## lm(formula = lwage ~ exp + sex, data = Wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.87081 -0.26688 0.01733 0.26336 1.90325 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.1257661 0.0223319 274.31 <2e-16 *** ## exp 0.0076134 0.0006082 12.52 <2e-16 *** ## sexmale 0.4501101 0.0210974 21.34 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4286 on 4162 degrees of freedom ## Multiple R-squared: 0.1381, Adjusted R-squared: 0.1377 ## F-statistic: 333.4 on 2 and 4162 DF, p-value: < 2.2e-16
sexmale! R appends the offset level to the variable name.sexfemale.female is the reference category.exp.