## Wednesday, July 11, 2012

### Out-of-sample performance of OLS vs. Nonparametric Regression

The true usefulness of a regression estimator should be judged by its out-of-sample performance. It is easy to achieve a perfect fit for a finite sample by fitting an N-order polynomial to the data. However, the estimator will then suffer by "high-variance" or poor out-of-sample forecasting accuracy.

Let's build on the example from the previous post http://statsmodels-np.blogspot.com/2012/07/nonparametric-regression.html

To summarize we are using the ccard dataset in statsmodels which has data on the average credit card expenditures of 72 individuals. We are interested in studying the relationship between income and credit card expenditure. We saw that, while a linear relationship is inadequate a quadratic relationship provides a good fit for the data.

Let's use the first 57 observations to train our estimates. To get the kernel regression estimate in statsmodels:

model = nparam.Reg(tydat=[np.log(avgexp[0:57])], txdat=[income[0:57]], var_type='c', bw='cv_lc')

We obtain the least-squares coefficients for the model Log(AvgExp) = a + b1*income +b2*income^2 + e to be:
a = 3.11, b1= 0.72, and b2= - 0.04

We then use the estimates from both models to forecast the remaining 15 observations. The accuracy of the forecast is judged by sum of the squared forecast error (SSFE) and the Mean Squared Forecast Error (MSFE).
To obtain the nonparametric out-of-sample forecasts we can simply specify the edat input in the Cond_Mean attribute of the Reg class:

np_fcast = model.Cond_Mean(edat=income[57:72])

The SSFE with OLS is 12.09 while in the nonparametric case it is only 3.416.

The MSFE with OLS is 0.81, while in the nonparametric case it is significantly lower : 0.23