Tuesday, August 21, 2012

End of GSoC

Now that the Google Summer of Code is officially over, I decided to spend some time in summarizing the work done over the summer months. Most of the goals that I set out to accomplish in my proposal: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/gpanterov/1 have been completed. In addition we added some new features, while some of the originally planned features received less attention.

However, I feel that the main goal has been achieved: statsmodels now has the ability to run some of the most popular nonparametric models like unconditional density estimation, conditional density estimation and nonparametric regression. In addition some initial capabilities have been included on nonparametric censored regression, semiparametric partially linear models, semiparametric single index models, nonparametric tests for functional form in a regression as well as for significance of regression coefficients.

Most models in this package come from Jeff Racine and Qi Li's text "Nonparametric Econometrics" and we owe a great deal to the authors, who did much of the heavy (theoretical) lifting.

In this post I outline the new features that have been added (or improved) since the midterm evaluations. To see the summary of the work completed by the midterm evaluations please see http://statsmodels-np.blogspot.com/2012/07/gsoc-midterm-evaluations.html

(Note that the unconditional density estimation class UKDE() has been renamed to KDE() and the conditional density estimation class CKDE() has been renamed to ConditionalKDE() for more clarity)

I) Added efficient bandwidth estimation to all nonparametric models

Nonparametric methods are computationally intensive. In fact, they can be so slow to run, even with modern computers, that they can render any real world application impossible. For example, running a regression with 3-4 variables and using the least squares cross validation (cv_ls) method for bandwidth selection on a sample of only 10,000 observation could take hours or even days. Furthermore the computational time can increase exponentially with increasing the number of observations. This is why any method that reduces the time required to run the models will be highly desirable.

We included an "efficient" bandwidth estimation method to each class of nonparametric models that helps speed up the estimation time. We implement this by breaking down the original (large) sample into smaller subsamples and performing the bandwidth estimation on them. In this, we make use of the fact that the scaling factor in the bandwidth is independent of the sample. This means that by aggregating all the scaling factors for all subsamples we can compute the scaling factor of the full sample. In practice, we take the median (or mean) of all scaling factors for the subsamples estimation and use it in the full sample.

For example, when estimating the unconditional density of two variables X and Y:
dens = nparam.KDE (tdat=[X, Y], var_type='cc', bw='cv_ls', defaults=SetDefaults(efficient=True))

the defaults input parameter is an instance of the "helper" class SetDefaults, the controls the parameters for the efficient bandwidth estimation. The class SetDefaults() includes the following attributes (and some other not discussed here. See docstrings for more details):

efficient: indicate whether to perform efficient bandwidth estimation or not. Default is False
n_sub: size of the subsamples. Default is 50
n_res: number of resamples of size n_sub on which to run the bandwidth estimation. Default is 50

This speed up arises from the fact that reducing the sample size by half can increase the speed by a factor of 4. 

II) Additions to the nonparametric regression class

Although the basic functionality of the Reg() class were developed by the midterm evaluation deadline, we added several new features and improved old ones in the second half of the summer.
In particular, we added the AIC Hurvich procedure for bandwidth selection. Before that, the only option available to the user was the least squares cross validation (cv_ls). The AIC feature has good theoretical properties and it is a nice alternative if cv_ls fails to converge to a plausible bandwidth.

We also improved the local linear estimator and the local constant estimator for the nonparametric regression. Now, the model fit() of the class Reg() produces the correct marginal effects for all variables in the regression.

We also added the method sig_test() (significance tests are discussed in more detail later) that can test the significance of a single variable or jointly several variables if they are significant in the regression. For example:

model = nparam.Reg(tydat=[Y], txdat=[X, Z], var_type='cc', reg_type='ll', bw='aic')

This will estimate the following nonparametric regression y = g(x, z) + e by using the local linear estimator (reg_type ='ll') and the AIC Hurvich bandwidth selection method (bw='aic').
If we wish to test if variable X is significant in this regression (i.e. it does affect the dependent variable Y) we can run:

model.sig_test([1])

which will report if the null hypothesis that dY/dX = 0 is rejected in favor of the alternative: dY/dX != 0.

III) Developed the semiparametric partially linear model

In many practical regression applications we may reasonably believe that certain number of the covariates enters in the regression linearly, while we don't know (or don't want to assume) the functional form with which the rest of the covariates impact the dependent variable. For example, we might believe that the data generating process looks like:

y = b'*X + g(Z) + e 

where we know that X enters the regression linearly but we don't know how Z enters the regression. If we estimate y = g(X,Z) + e would be an "overkill" and it will fail to incorporate all the information about the model. We can implement this model in statsmodels by:

model = nparam.SemiLinear(tydat=[y], txdat=[x1, x2,...,xK], tzdat=[z1, z2, ..., zN], reg_type='lc', var_type='cc..o..u..c', bw='cv_ls')

The class SemiLinear() has the same methods and attributes as the Reg with the difference that the fit() method produces the marginal effects for the nonparametric elements (Z) variables dY/dZ. The class also has a b attribute which holds the vector of coefficients for the linear variables in the model -- X

IV) Developed the semiparametric single index model

Often we will know something about the data generating process but not the whole story. For example, we might know that the data X enters linearly in the function g() but we don't know the functional form of g(). For example, in many binary choice models, we assume that P(y=1) = g(b'*X) . In the usual parametric framework we assume that g() is either a logistic or the cdf of the normal distribution. In a nonparametric setting we only assume that the input of the function g() is b'*X and we don't make any assumptions about g() but rather we estimate it with the nonparametric kernel methods. This model is semiparametric because we do impose some structure, i.e. the way X enters into g(), but not the whole story. We can estimate this model by

model = nparam.SemiIndexModel(tydat=[y], txdat=[x1, x2, ...,xN], var_type='cc..o...u...c', reg_type='lc', bw='cv_ls', defaults=SetDefaults(efficient=True))

print "The coefficients are ", model.b

As was the case with the semiparametric partially linear model this class has an attriube b which has the estimate of the coefficients b.  In this case model.fit() will produce the marginal effects with respect to both b and X i.e. dy/d(b'*X)

V) Developed some (initial) censored regression capabilities

Often, when working with social science data, the variables of interest are truncated or censored in a particular way. For example, suppose that we want to model income in poor neighborhoods. Government statistics often group incomes under some level together. For example "under $10,000". However, within this category there are many households that have different levels of income. We call this type of variable censored from the left. If we don't account for this censoring, the marginal effects in the regression will be incorrectly skewed because all observations in "under $10,000" are treated as if they are clumped together.

In order to run a nonparametric censored regression model, where the dependent variable y is is for example income, we can do the following:

model = nparam.CensoredReg(tydat=[y], txdat[x1, x2, ..., xK], var_type='cc...o..u..c', censor_val=10,000)

censor_val provides the value at which the dependent variable tydat is left-censored. The methods and attributes for the class CensoredReg() are the same as those for the class Reg(). For example one can obtain the fitted values and the marginal effects evaluated at X':

mean, mfx = model.fit(edat=[x1', x2', ..., xK'])

The current functionality limits the regression type to local linear (reg_type='ll') and to left-censored dependent variables.

This class can be improved on significantly. It was originally written as part of the Reg() class but we decided to remove it from the generic nonparametric regression.

VI) Added nonparametric significance tests for the variables in a nonparametric regression

The algorithms for estimating the significance of continuous and discrete variables are different. Therefore we added to additional classes TestRegCoefC() and TestRegCoefD() for testing continuous and discrete variables in a regression respectively.
Both classes take as inputs model, which is an instance of the Reg() class (this is the nonparametric regression, whose variables we want to test) and var_pos, which is the position of the variable(s) to be tested in model.txdat. For example,

reg_model = nparam.Reg(tydat=[y], txdat=[x1, x2, x3], var_type='cco', bw='aic', reg_type='ll')

will estimate the local linear regression y = g(x1, x2, x3) + e where x1 and x2 are continuous and x3 is discrete order variable. To test if x1 is significant we can do the following:

test_c = nparam.TestRegCoefC(model=reg_model, var_pos =[0])
print "The significance level is ", test_c.sig

Alternatively, to test the discrete variable x3 for significance we can:

test_d = nparam.TestRegCoefD(model=reg_model, var_pos=[2])
print "The significance level is ", test_d.sig

The TestRegCoefC and TestRegCoefD classes have the attribute sig which is a string that takes the following values:

"Not Significant": Fails to reject the null hypothesis
"*": Significant at the 90% significance level

"**": Significant at the 95% significance level
"***": Significant at the 99% significance level

A better way to test for significance is through the method sig_test() of the nonparametric regression class Reg() which was already discussed in part II). One advantage is that the method automatically determines if the variable is continuous or discrete and uses the appropriate class.

Both tests work in similar ways. They estimate a statistic T, that is somehow directly related to the marginal effects dy/dxN. Then they start drawing bootstrap samples {Y*,X*} that impose the null hypothesis (dy/dxN = 0) and estimate the same statistic for each sample [T*_1, T*_2, .... T*_b] (for b samples). If the original test statistic T is larger than the (1-a)th quantile of [T*_1, T*_2, .... T*_b] then we reject the null correctly (1-a)% of the times.


Currently, TestCoefC() allows testing joint hypothesis of continuous variables. For example:

model = nparam.TestCoefC(model=reg_model, var_pos=[0,1])

will test the joint null hypothesis that variables in position 0 and 1 both have no effect on the dependent variable versus the alternative that at least one of them does. This is similar to the popular F-test in the linear regression framework.

Currently, testing of joint hypothesis only works for continuous variables. Jointly testing discrete variables or mixed variables is not allowed.

VII) Added the nonparametric test for functional form

Parametric tests for functional form can be very restrictive because of their requirement to specify a specific alternative hypothesis. This means that the tests have power only in the direction of the alternative hypothesis, which may or may not be correct. For example, testing for functional form,  the null hypothesis could be something like

H0: E[y|x1, x2] = a + b1*x1 + b2*x2 + e

and the alternative:

Ha: E[y|x1, x2] = a + b1*x1 + b2*x2 + b3*log(x2) + e

If we reject the null in the direction of the alternative, this doesn't necessarily mean that the this alternative hypothesis is correct. Perhaps there is another Ha' which is the true one. 

Nonparametric tests for functional form don't suffer from this problem because they test the null

H0: E[y|X] = m(x,b)

versus the alternative

Ha: E[y|X] = g(x) != m(x,b)

This nonparametric test is implemented through the TestFForm() class.The following example illustrates how to implement the nonparametric test for functional form. Suppose that we want to test if y = x*b + e, i.e the functional dependence between y and x is linear.

import statsmodels.api as sm
import statsmodels.nonparametric as nparam

lin_fform = lambda x, b: x*b  # Linear functional form
lstsq_estimator = lambda y, x: sm.OLS(y,x).fit().params  # Least sq. estimator for beta

test = nparam.TestFForm(tydat=[y], txdat=[x], var_type='c', bw='cv_ls', fform=lin_fform, estimator=lstsq_estimator)

print "The significance level is ", test.sig

where fform is the functional form to be tested and estimator is a function that returns the coefficients (marginal effects) for the model. Since this is a nonparametric test, which involves kernel estimation of g(x), we also need to specify the bandwidth selection procedure and the variable type in txdat. Ideally estimator will be the Non-linear least square estimator (NLLS).

VIII) Summary

Here is a quick list of all the features added since the midterm evaluations:
  • AIC Hurvich bandwidth selection procedure for nonparametric regression
  • Marginal effects for local linear and local constant estimator in nonparametric regression for both continuous and discrete variables
  • Efficient bandwidth estimation procedures by breaking down the data into smaller parts. This significantly reduces the computational time required for the data-driven bandwidth selection procedures
  • Significance tests for nonparametric regression -- Reg().sig_test(var_pos)
  • Semiparametric partially linear model -- SemiLinear()
  • Semiparametric single index model -- SemiIndexModel()
  • Nonparametric censored regression model -- CensoredReg()
  • Nonparametric tests for continuous variables in a nonparametric regression  -- TestRegCoefC()
  • Nonparametric tests for discrete variables in a nonparametric regression -- TestRegCoefD()
  • Nonparametric tests for function form -- TestFForm()
  • Tests and examples for the most prominent features of the models

IX) Areas for future work and improvement

Naturally there are many parts of this code where improvements could be made. I will try to summarize some of the most important areas, that in my opinion should be improved in the future.

The speed of the data-driven bandwidth selection procedures is still quite low. For some models, R's np package converges more quickly than statsmodels. Therefore, some code optimization for speed should be on the top of our wishlist. For example, some of the for-loops (e.g. when estimating leave-one-out likelihoods) could be made parallel with the help of libraries such as joblib. Furthermore, some parts of the code (the kernels?) could be re-written in Cython to increase their speed.

The censored regression class CensoredReg() could be modified so that it can handle both left and right censored data.

The nonparametric functional form test TestFForm() should use as the default value for the estimator input a NLLS. This is currently under development in statsmodels so as soon as this is done this class should be revisited and tested. 

More tests for all features of all models are always on the wishlist.