Introduction to Model Specification and Data Issues
This topic brings up several issues in multiple regressions from the perspectives of
data and specification when this technique is applied to empirical applications.
- Data: three types of special data points
- Multicollinearity: linear correlation among regressors
- Specification error: functional form misspecification
- Proxy variables
- Measurement errors
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
2 / 29Definition of Special Data
- Outlier: a data point whose dependent variable y does not follow the general
trend of the rest of the data:
Large residual: Juil = Yi - Yi| is large compared to other residuals.
- Leverage: a data point has high leverage if it has "extreme" values of an
independent variable(s), x.
- Influential point: a data point which unduly influences any part of a
regression: estimation, prediction, or hypothesis tests.
Outliers and high leverage data points have the potential to be influential.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
3 /29Example: outlier, leverage, influential point
Scatterplot Examples
Scatterplot of y vs x
50
40
30
y
20
10
0
.
0
1
2
3
4
5
6
7
9
×
(a) Baseline
Scatterplot of y vs x
80
70
y = 1.73 +5.12x
60
50
> 40
y =2.47+4.93x
30
20
10
0
0
2
4
6
8
10
12
14
x
(c) High leverage
Scatterplot of y vs x
50
40
y =2.96+5.04x
30
y
20
y= 1.73 +5.12x
10
0
0
1
2
3
4
5
6
7
60
9
x
(b) Outlier
Scatterplot of y vs x
50
y = 1.73 +5.12x
40
30
y =8.51+3.32x
20
10
0
0
2
4
6
8
10
12
14
x
(d) Influential point
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
4 /29Indicator for leverage (Con'd)
Leverage Indicator in Regression
Recall the matrix form of regression:
Y = XB+U,
. OLS estimate: 3 = (x'x)" (x'Y).
. Fitted value (predicted values of Y): Y = X 3 = X ( X'X)
H
′
-1
XY = HY.
. The "hat" matrix: H = X ( X'X)
-1
X.
From the previous representation, any fitted value can be written as:
yi = hay1 + hiy2 + ... + hiiyi + ... + hinyn, for i = 1, 2, ... , n
(1)
. leverage, hij quantifies the influence of yi on its predicted value gi.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
5 /29Indicator for leverage (Con'd)
Properties of Leverages
Some properties of the "leverages", hii:
- hij measures the distance between value x of ith data point and the mean of
x values for all n points.
- hij lies between 0 and 1, inclusively.
- Ei-1 hii = k, which is the number of coefficients included in the regression
(including intercept).
Criteria:
- Extreme x value: hij > 3 .k.
- Large leverage point: hij > 2k
- "Rule of Thumb".
A dotplot containing just the x values:
h(1,1) = 0.153
h(11,11) =0.048
h(21,21) = 0.358
>
:
. . :
.
x
0
2
4
6
8
10
12
14
sample mean = 5.227
Figure: leverage
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
6/29
..Identifying outlier (Con'd)
Identifying Outliers
- residual: û; = Yi - Yi for i = 1,2, . . . , n.
- standardized residuals:
ri =
se(ûi)
=
MSE(1 - hii)
,
where MSE is the mean squared error of uj.
> outlier: ri > 3, sometimes cut-off value 2 is adopted.
- studentized residual:
di =
MSE(i) (1 - hii)
,
where MSE(i) is the mean squared error of the residuals obtained from the
regression with ith observation deleted.
> outlier: di > 3.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
7 /29Identifying Influential Data Points (Con'd)
Identifying Influential Data Points
- Difference in Fits (DFFITS):
Îi - Û(¿)
DFFITSi =
Influential data point if
|DF F T Si| > 2
k+ 1
n- k-1
- Cook's distance:
Di =
(yi - yi)2
k * MSE
hij
(1 - hij)2
.
summarizes how much all of the fitted values changes when the ith obs is
deleted.
a data point with larger Di indicates a stronger influence on the fitted values.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
8/29Least Absolute Deviations Estimation
Least Absolute Deviations (LAD) Estimation
One remedial method against outliers: least absolute deviations (LAD).
- The LAD estimators of Bj:
min
b1,b2, ... ,bk
i=1
n
Vi - b1 - b2x12 -
- bkxik .
the OLS gives increasing importance to larger residuals; the LAD does not.
the LAD is less sensitive to outlying obs.
LAD estimates the conditional median:
Med(Y|X) = B1+ B2X2 + 33X3+ ... +3kXk
- Drawbacks:
More computationally intensive than OLS.
No exact inference for LAD under CLM assumptions.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
9 / 29Special Data: R&D Intensity
R&D Intensity and Firm Size
Using the following regression model, we want to test whether R&D intensity
increases with firm size, which is characterized by 32 > 1.
log(rd) = 31+ B2 log(sales) + B3profmarg + u.
(2)
Or with another approach with the following specification to test 32 > 0,
rdintens = B1 + B2sales + B3profmarg + u,
(3)
where rdintens describes the expenditures as a percentage of sales, and sales
measures the amount of sales in millions, profmarg represents the profits as a
percentage of sales.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
10 /29Perfect Multicollinearity
Multicollinearity
Perfect Multicollinearity in Regression
Recall one assumption in the classical linear regression models, no perfect
multicollinearity among regressors included in the regression model1.
- What is the perfect multicollinearity
- What are the practical consequences
- What remedial measures can be taken to alleviate the problem
1which is expressed by rank(X'X) = k.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
11 /29Multicollinearity
Definition of Multicollinearity
Consider a general linear regression model
Y = B1+ B2X2+ 33X3+ ... +BKXK +u
(4)
- The definition of perfect multicollinearity: for a k-variable regression with
explanatory variables X1, X2, . .. Xk where X1 = 1, there exists an exact
linear relationship among them if the following condition is satisfied:
X1X1+12X2+ ... +AkXk=0
(5)
where X1, 12, ... , Ak are constants but not all zeros simultaneously.
- Imperfect multicollinearity: k variables are intercorrelated but not in a perfect
way:
X1X1+X2X2+ ... +AkXk +v=0
(6)
where X1, 2, ... , Ak are constants but not all zeros simultaneously, and v is
a stochastic component.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
12 / 29Multicollinearity: example
Multicollinearity Examples
- For example:
X2
X3
10
50
52
15
75
75
18
90
97
24
120
129
30
150
152
Figure: collinearity
what is the relationship between X2 and X3? what about X2 and X?
- Another example of election votes:
voteA = B1 + B2expenA + B3expendB + BAtotexpend + u
- What about:
Y = B1+B2X2+B3X2 +BAX2+u=>
linear relationship among X2, X2, X3
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
13 / 29Multicollinearity: illustration
Collinearity Illustration
Y
X3
X2
X3
(a) No collinearity
(b) Low collinearity
Y
Y
Y
X3
X2
X3
X2
X3
X2
(c) Moderate collinearity
(d) High collinearity
(e) Very high collinearity
Figure: collinearity
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
14 / 29
Y
X2Multicollinearity: sources
Sources of Multicollinearity
- The data collection method employed.
- Constraints on the model or in the population being sampled.
- Model specification.
- An overdetermined model.
- Common trend shared by the regressors (for time series data).
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
15 /29Perfect Multicollinearity
Consequences and Remedies for Perfect Multicollinearity
- Consequence:
The regression coefficients remain indeterminate and their standard errors are
infinite.
The coefficients cannot be estimated by OLS.
For example: Y = B2X2 + 33X3 + u, given X3 = 2X2:
* Is Y = (B2+ 33)X2+(233)X3 + u correct specification?
* What about Y = (32 - 233) X2 + (233) X3 + u?
- Remedy: drop any one of the regressors where the exact linear relationship is
found.
- Special case with dummies: dummy variable trap.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
16 /29Imperfect Multicollinearity: Consequences
Consequences of Imperfect Multicollinearity
In the case of near or high multicollinearity,
- Large variances and covariances of OLS estimators.
- Wide confidence intervals.
- "Insignificant" t ratio.
- A high R2 but few significant t ratios
- Sensitivity of OLS estimators and their standard errors to small changes in
data.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
17 / 29The Components of the OLS Variances
Components of OLS Variances
Under the classical assumptions of the multiple linear regression2, we have
Var (₿;) =;
02
TSS; (1 - R2)
,
for j = 1,2, ... , k,
(7)
where TSSj = > ;- 1 (Xij - X;) is the total sample variation in Xj, and R; is
the R-squared from regressing Xj on all other independent variables.
- The error variance o2: a larger o2 means larger variances for OLS estimators.
- The total sample variation in Xj, TSS ;: the larger the total variation in Xi
is, the smaller is Var(B;).
- The linear relationships among X, R&: the larger R? is, the higher degree of
multicollinearity among X, the larger Var(3;) is.
R3 = 0, the smallest Var(3;) is obtained.
R2 = 1, perfect multicollinearity among X. Thus, Var(Bi) > CO.
Variance inflation factor (VIF) *: VIF; =
- R2 , as the e
as the extent of collinearity.
2 Normality assumption of the error term is not needed.
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
18 /29General comments on multicollinearity
General Comments on Multicollinearity
- High/near multicollinearity is NOT a violation of any assumptions of classical
linear regression models.
- There is no absolute number that we can cite or test we can rely on to
conclude that there is a multicollinearity problem-
It is just a sample
feature.
Is R2 = 0.9 too high? It only means a strong linear correlation among X, but
does not invalidate OLS estimation.
- The same problem (as a high degree of collinearity) can arise from a small
sample size, "micronumerosity".
- A high degree of correlation between certain independent variables can be
irrelevant if we are only interested in other parameters.
Y = B1+ B2X2 + 33X3 + 34X4 + u
The amount of correlation between X3 and X4 has no direct effect on Var(32)
- Although the problem of multicollinearity is not clearly defined, it would be
better to have less correlation among
X
Increase the sample size!
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
19 / 29Functional Form Misspecification
Functional Form Misspecification
Understanding Functional Form Misspecification
Functional form misspecification: a multiple regression model does not properly
account for the relationship between the observed explanatory variables.
- log(wage) = B1 + B2educ + Byexper + Byexper2 + u
biased estimation of the return to education.
misleading interpretation on the return to experience.
- log(wage) =
B1 + B2educ + Byexper + BAexper2 + 35 female + Bofemale . educ + u
- wage = B1 + B2educ + B3exper + Byexper2 + u
For some misspecified functional forms: the F test, e.g. quadratic terms.
In other cases, a more general functional form misspecification test is needed. = >
RESET
Jin (Bachelor in ADE)
Lesson 5
Business Econometrics
20 / 29