Lesson 5: Model Specification and Data Issues from Universitat De Barcelona

Slides from Universitat De Barcelona about Lesson 5: Model Specification and Data Issues. The Pdf explores model specification and data issues in multiple regressions, discussing concepts like outliers, leverage, and multicollinearity. This material is suitable for university-level Economics courses.

29 Pages

Lesson 5: Model Speciﬁcation and Data Issues

Weifeng Jin

Department of Econometrics, Statistics and Applied Economics

Universitat de Barcelona

Business Econometrics

Academic Year: 2024-2025

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 1 / 29

Introduction

This topic brings up several issues in multiple regressions from the perspectives of

data and speciﬁcation when this technique is applied to empirical applications.

- Data: three types of special data points

- Multicollinearity: linear correlation among regressors

- Speciﬁcation error: functional form misspeciﬁcation

- Proxy variables

- Measurement errors

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 2 / 29

Preview

Introduction to Model Specification and Data Issues

This topic brings up several issues in multiple regressions from the perspectives of data and specification when this technique is applied to empirical applications.

Data: three types of special data points
Multicollinearity: linear correlation among regressors
Specification error: functional form misspecification
Proxy variables
Measurement errors

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 2 / 29Definition of Special Data

Outlier: a data point whose dependent variable y does not follow the general trend of the rest of the data: Large residual: Juil = Yi - Yi| is large compared to other residuals.
Leverage: a data point has high leverage if it has "extreme" values of an independent variable(s), x.
Influential point: a data point which unduly influences any part of a regression: estimation, prediction, or hypothesis tests. Outliers and high leverage data points have the potential to be influential.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 3 /29Example: outlier, leverage, influential point

Scatterplot Examples

Scatterplot of y vs x 50 40 30 y 20 10 0 . 0 1 2 3 4 5 6 7 9 × (a) Baseline Scatterplot of y vs x 80 70 y = 1.73 +5.12x 60 50 > 40 y =2.47+4.93x 30 20 10 0 0 2 4 6 8 10 12 14 x (c) High leverage Scatterplot of y vs x 50 40 y =2.96+5.04x 30 y 20 y= 1.73 +5.12x 10 0 0 1 2 3 4 5 6 7 60 9 x (b) Outlier Scatterplot of y vs x 50 y = 1.73 +5.12x 40 30 y =8.51+3.32x 20 10 0 0 2 4 6 8 10 12 14 x (d) Influential point

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 4 /29Indicator for leverage (Con'd)

Leverage Indicator in Regression

Recall the matrix form of regression: Y = XB+U, . OLS estimate: 3 = (x'x)" (x'Y). . Fitted value (predicted values of Y): Y = X 3 = X ( X'X) H ′ -1 XY = HY. . The "hat" matrix: H = X ( X'X) -1 X. From the previous representation, any fitted value can be written as: yi = hay1 + hiy2 + ... + hiiyi + ... + hinyn, for i = 1, 2, ... , n (1) . leverage, hij quantifies the influence of yi on its predicted value gi.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 5 /29Indicator for leverage (Con'd)

Properties of Leverages

Some properties of the "leverages", hii:

hij measures the distance between value x of ith data point and the mean of x values for all n points.
hij lies between 0 and 1, inclusively.
Ei-1 hii = k, which is the number of coefficients included in the regression (including intercept). Criteria:
Extreme x value: hij > 3 .k.
Large leverage point: hij > 2k - "Rule of Thumb". A dotplot containing just the x values: h(1,1) = 0.153 h(11,11) =0.048 h(21,21) = 0.358 > : . . : . x 0 2 4 6 8 10 12 14 sample mean = 5.227 Figure: leverage

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 6/29 ..Identifying outlier (Con'd)

Identifying Outliers

residual: û; = Yi - Yi for i = 1,2, . . . , n.
standardized residuals: ri = se(ûi) = MSE(1 - hii) , where MSE is the mean squared error of uj. > outlier: ri > 3, sometimes cut-off value 2 is adopted.
studentized residual: di = MSE(i) (1 - hii) , where MSE(i) is the mean squared error of the residuals obtained from the regression with ith observation deleted. > outlier: di > 3.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 7 /29Identifying Influential Data Points (Con'd)

Identifying Influential Data Points

Difference in Fits (DFFITS): Îi - Û(¿) DFFITSi = Influential data point if |DF F T Si| > 2 k+ 1 n- k-1
Cook's distance: Di = (yi - yi)2 k * MSE hij (1 - hij)2 . summarizes how much all of the fitted values changes when the ith obs is deleted. a data point with larger Di indicates a stronger influence on the fitted values.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 8/29Least Absolute Deviations Estimation

Least Absolute Deviations (LAD) Estimation

One remedial method against outliers: least absolute deviations (LAD).

The LAD estimators of Bj: min b1,b2, ... ,bk i=1 n Vi - b1 - b2x12 - - bkxik . the OLS gives increasing importance to larger residuals; the LAD does not. the LAD is less sensitive to outlying obs. LAD estimates the conditional median: Med(Y|X) = B1+ B2X2 + 33X3+ ... +3kXk
Drawbacks: More computationally intensive than OLS. No exact inference for LAD under CLM assumptions.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 9 / 29Special Data: R&D Intensity

R&D Intensity and Firm Size

Using the following regression model, we want to test whether R&D intensity increases with firm size, which is characterized by 32 > 1. log(rd) = 31+ B2 log(sales) + B3profmarg + u. (2) Or with another approach with the following specification to test 32 > 0, rdintens = B1 + B2sales + B3profmarg + u, (3) where rdintens describes the expenditures as a percentage of sales, and sales measures the amount of sales in millions, profmarg represents the profits as a percentage of sales.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 10 /29Perfect Multicollinearity

Multicollinearity

Perfect Multicollinearity in Regression

Recall one assumption in the classical linear regression models, no perfect multicollinearity among regressors included in the regression model1.

What is the perfect multicollinearity
What are the practical consequences
What remedial measures can be taken to alleviate the problem 1which is expressed by rank(X'X) = k.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 11 /29Multicollinearity

Definition of Multicollinearity

Consider a general linear regression model Y = B1+ B2X2+ 33X3+ ... +BKXK +u (4)

The definition of perfect multicollinearity: for a k-variable regression with explanatory variables X1, X2, . .. Xk where X1 = 1, there exists an exact linear relationship among them if the following condition is satisfied: X1X1+12X2+ ... +AkXk=0 (5) where X1, 12, ... , Ak are constants but not all zeros simultaneously.
Imperfect multicollinearity: k variables are intercorrelated but not in a perfect way: X1X1+X2X2+ ... +AkXk +v=0 (6) where X1, 2, ... , Ak are constants but not all zeros simultaneously, and v is a stochastic component.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 12 / 29Multicollinearity: example

Multicollinearity Examples

For example: X2 X3 10 50 52 15 75 75 18 90 97 24 120 129 30 150 152 Figure: collinearity what is the relationship between X2 and X3? what about X2 and X?
Another example of election votes: voteA = B1 + B2expenA + B3expendB + BAtotexpend + u
What about: Y = B1+B2X2+B3X2 +BAX2+u=> linear relationship among X2, X2, X3

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 13 / 29Multicollinearity: illustration

Collinearity Illustration

Y X3 X2 X3 (a) No collinearity (b) Low collinearity Y Y Y X3 X2 X3 X2 X3 X2 (c) Moderate collinearity (d) High collinearity (e) Very high collinearity Figure: collinearity

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 14 / 29 Y X2Multicollinearity: sources

Sources of Multicollinearity

The data collection method employed.
Constraints on the model or in the population being sampled.
Model specification.
An overdetermined model.
Common trend shared by the regressors (for time series data).

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 15 /29Perfect Multicollinearity

Consequences and Remedies for Perfect Multicollinearity

Consequence: The regression coefficients remain indeterminate and their standard errors are infinite. The coefficients cannot be estimated by OLS. For example: Y = B2X2 + 33X3 + u, given X3 = 2X2:

* Is Y = (B2+ 33)X2+(233)X3 + u correct specification? * What about Y = (32 - 233) X2 + (233) X3 + u?

Remedy: drop any one of the regressors where the exact linear relationship is found.
Special case with dummies: dummy variable trap.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 16 /29Imperfect Multicollinearity: Consequences

Consequences of Imperfect Multicollinearity

In the case of near or high multicollinearity,

Large variances and covariances of OLS estimators.
Wide confidence intervals.
"Insignificant" t ratio.
A high R2 but few significant t ratios
Sensitivity of OLS estimators and their standard errors to small changes in data.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 17 / 29The Components of the OLS Variances

Components of OLS Variances

Under the classical assumptions of the multiple linear regression2, we have Var (₿;) =; 02 TSS; (1 - R2) , for j = 1,2, ... , k, (7) where TSSj = > ;- 1 (Xij - X;) is the total sample variation in Xj, and R; is the R-squared from regressing Xj on all other independent variables.

The error variance o2: a larger o2 means larger variances for OLS estimators.
The total sample variation in Xj, TSS ;: the larger the total variation in Xi is, the smaller is Var(B;).
The linear relationships among X, R&: the larger R? is, the higher degree of multicollinearity among X, the larger Var(3;) is. R3 = 0, the smallest Var(3;) is obtained. R2 = 1, perfect multicollinearity among X. Thus, Var(Bi) > CO. Variance inflation factor (VIF) *: VIF; = - R2 , as the e as the extent of collinearity. 2 Normality assumption of the error term is not needed.

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 18 /29General comments on multicollinearity

General Comments on Multicollinearity

High/near multicollinearity is NOT a violation of any assumptions of classical linear regression models.
There is no absolute number that we can cite or test we can rely on to conclude that there is a multicollinearity problem- It is just a sample feature. Is R2 = 0.9 too high? It only means a strong linear correlation among X, but does not invalidate OLS estimation.
The same problem (as a high degree of collinearity) can arise from a small sample size, "micronumerosity".
A high degree of correlation between certain independent variables can be irrelevant if we are only interested in other parameters. Y = B1+ B2X2 + 33X3 + 34X4 + u The amount of correlation between X3 and X4 has no direct effect on Var(32)
Although the problem of multicollinearity is not clearly defined, it would be better to have less correlation among X Increase the sample size!

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 19 / 29Functional Form Misspecification

Functional Form Misspecification

Understanding Functional Form Misspecification

Functional form misspecification: a multiple regression model does not properly account for the relationship between the observed explanatory variables.

log(wage) = B1 + B2educ + Byexper + Byexper2 + u biased estimation of the return to education. misleading interpretation on the return to experience.
log(wage) = B1 + B2educ + Byexper + BAexper2 + 35 female + Bofemale . educ + u
wage = B1 + B2educ + B3exper + Byexper2 + u For some misspecified functional forms: the F test, e.g. quadratic terms. In other cases, a more general functional form misspecification test is needed. = > RESET

Jin (Bachelor in ADE) Lesson 5 Business Econometrics 20 / 29

Can’t find what you’re looking for?

Explore more topics in the Algor library or create your own materials with AI.