Tuesday, August 21, 2012

End of GSoC

Now that the Google Summer of Code is officially over, I decided to spend some time in summarizing the work done over the summer months. Most of the goals that I set out to accomplish in my proposal: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/gpanterov/1 have been completed. In addition we added some new features, while some of the originally planned features received less attention.

However, I feel that the main goal has been achieved: statsmodels now has the ability to run some of the most popular nonparametric models like unconditional density estimation, conditional density estimation and nonparametric regression. In addition some initial capabilities have been included on nonparametric censored regression, semiparametric partially linear models, semiparametric single index models, nonparametric tests for functional form in a regression as well as for significance of regression coefficients.

Most models in this package come from Jeff Racine and Qi Li's text "Nonparametric Econometrics" and we owe a great deal to the authors, who did much of the heavy (theoretical) lifting.

In this post I outline the new features that have been added (or improved) since the midterm evaluations. To see the summary of the work completed by the midterm evaluations please see http://statsmodels-np.blogspot.com/2012/07/gsoc-midterm-evaluations.html

(Note that the unconditional density estimation class UKDE() has been renamed to KDE() and the conditional density estimation class CKDE() has been renamed to ConditionalKDE() for more clarity)

I) Added efficient bandwidth estimation to all nonparametric models

Nonparametric methods are computationally intensive. In fact, they can be so slow to run, even with modern computers, that they can render any real world application impossible. For example, running a regression with 3-4 variables and using the least squares cross validation (cv_ls) method for bandwidth selection on a sample of only 10,000 observation could take hours or even days. Furthermore the computational time can increase exponentially with increasing the number of observations. This is why any method that reduces the time required to run the models will be highly desirable.

We included an "efficient" bandwidth estimation method to each class of nonparametric models that helps speed up the estimation time. We implement this by breaking down the original (large) sample into smaller subsamples and performing the bandwidth estimation on them. In this, we make use of the fact that the scaling factor in the bandwidth is independent of the sample. This means that by aggregating all the scaling factors for all subsamples we can compute the scaling factor of the full sample. In practice, we take the median (or mean) of all scaling factors for the subsamples estimation and use it in the full sample.

For example, when estimating the unconditional density of two variables X and Y:
dens = nparam.KDE (tdat=[X, Y], var_type='cc', bw='cv_ls', defaults=SetDefaults(efficient=True))

the defaults input parameter is an instance of the "helper" class SetDefaults, the controls the parameters for the efficient bandwidth estimation. The class SetDefaults() includes the following attributes (and some other not discussed here. See docstrings for more details):

efficient: indicate whether to perform efficient bandwidth estimation or not. Default is False
n_sub: size of the subsamples. Default is 50
n_res: number of resamples of size n_sub on which to run the bandwidth estimation. Default is 50

This speed up arises from the fact that reducing the sample size by half can increase the speed by a factor of 4. 

II) Additions to the nonparametric regression class

Although the basic functionality of the Reg() class were developed by the midterm evaluation deadline, we added several new features and improved old ones in the second half of the summer.
In particular, we added the AIC Hurvich procedure for bandwidth selection. Before that, the only option available to the user was the least squares cross validation (cv_ls). The AIC feature has good theoretical properties and it is a nice alternative if cv_ls fails to converge to a plausible bandwidth.

We also improved the local linear estimator and the local constant estimator for the nonparametric regression. Now, the model fit() of the class Reg() produces the correct marginal effects for all variables in the regression.

We also added the method sig_test() (significance tests are discussed in more detail later) that can test the significance of a single variable or jointly several variables if they are significant in the regression. For example:

model = nparam.Reg(tydat=[Y], txdat=[X, Z], var_type='cc', reg_type='ll', bw='aic')

This will estimate the following nonparametric regression y = g(x, z) + e by using the local linear estimator (reg_type ='ll') and the AIC Hurvich bandwidth selection method (bw='aic').
If we wish to test if variable X is significant in this regression (i.e. it does affect the dependent variable Y) we can run:


which will report if the null hypothesis that dY/dX = 0 is rejected in favor of the alternative: dY/dX != 0.

III) Developed the semiparametric partially linear model

In many practical regression applications we may reasonably believe that certain number of the covariates enters in the regression linearly, while we don't know (or don't want to assume) the functional form with which the rest of the covariates impact the dependent variable. For example, we might believe that the data generating process looks like:

y = b'*X + g(Z) + e 

where we know that X enters the regression linearly but we don't know how Z enters the regression. If we estimate y = g(X,Z) + e would be an "overkill" and it will fail to incorporate all the information about the model. We can implement this model in statsmodels by:

model = nparam.SemiLinear(tydat=[y], txdat=[x1, x2,...,xK], tzdat=[z1, z2, ..., zN], reg_type='lc', var_type='cc..o..u..c', bw='cv_ls')

The class SemiLinear() has the same methods and attributes as the Reg with the difference that the fit() method produces the marginal effects for the nonparametric elements (Z) variables dY/dZ. The class also has a b attribute which holds the vector of coefficients for the linear variables in the model -- X

IV) Developed the semiparametric single index model

Often we will know something about the data generating process but not the whole story. For example, we might know that the data X enters linearly in the function g() but we don't know the functional form of g(). For example, in many binary choice models, we assume that P(y=1) = g(b'*X) . In the usual parametric framework we assume that g() is either a logistic or the cdf of the normal distribution. In a nonparametric setting we only assume that the input of the function g() is b'*X and we don't make any assumptions about g() but rather we estimate it with the nonparametric kernel methods. This model is semiparametric because we do impose some structure, i.e. the way X enters into g(), but not the whole story. We can estimate this model by

model = nparam.SemiIndexModel(tydat=[y], txdat=[x1, x2, ...,xN], var_type='cc..o...u...c', reg_type='lc', bw='cv_ls', defaults=SetDefaults(efficient=True))

print "The coefficients are ", model.b

As was the case with the semiparametric partially linear model this class has an attriube b which has the estimate of the coefficients b.  In this case model.fit() will produce the marginal effects with respect to both b and X i.e. dy/d(b'*X)

V) Developed some (initial) censored regression capabilities

Often, when working with social science data, the variables of interest are truncated or censored in a particular way. For example, suppose that we want to model income in poor neighborhoods. Government statistics often group incomes under some level together. For example "under $10,000". However, within this category there are many households that have different levels of income. We call this type of variable censored from the left. If we don't account for this censoring, the marginal effects in the regression will be incorrectly skewed because all observations in "under $10,000" are treated as if they are clumped together.

In order to run a nonparametric censored regression model, where the dependent variable y is is for example income, we can do the following:

model = nparam.CensoredReg(tydat=[y], txdat[x1, x2, ..., xK], var_type='cc...o..u..c', censor_val=10,000)

censor_val provides the value at which the dependent variable tydat is left-censored. The methods and attributes for the class CensoredReg() are the same as those for the class Reg(). For example one can obtain the fitted values and the marginal effects evaluated at X':

mean, mfx = model.fit(edat=[x1', x2', ..., xK'])

The current functionality limits the regression type to local linear (reg_type='ll') and to left-censored dependent variables.

This class can be improved on significantly. It was originally written as part of the Reg() class but we decided to remove it from the generic nonparametric regression.

VI) Added nonparametric significance tests for the variables in a nonparametric regression

The algorithms for estimating the significance of continuous and discrete variables are different. Therefore we added to additional classes TestRegCoefC() and TestRegCoefD() for testing continuous and discrete variables in a regression respectively.
Both classes take as inputs model, which is an instance of the Reg() class (this is the nonparametric regression, whose variables we want to test) and var_pos, which is the position of the variable(s) to be tested in model.txdat. For example,

reg_model = nparam.Reg(tydat=[y], txdat=[x1, x2, x3], var_type='cco', bw='aic', reg_type='ll')

will estimate the local linear regression y = g(x1, x2, x3) + e where x1 and x2 are continuous and x3 is discrete order variable. To test if x1 is significant we can do the following:

test_c = nparam.TestRegCoefC(model=reg_model, var_pos =[0])
print "The significance level is ", test_c.sig

Alternatively, to test the discrete variable x3 for significance we can:

test_d = nparam.TestRegCoefD(model=reg_model, var_pos=[2])
print "The significance level is ", test_d.sig

The TestRegCoefC and TestRegCoefD classes have the attribute sig which is a string that takes the following values:

"Not Significant": Fails to reject the null hypothesis
"*": Significant at the 90% significance level

"**": Significant at the 95% significance level
"***": Significant at the 99% significance level

A better way to test for significance is through the method sig_test() of the nonparametric regression class Reg() which was already discussed in part II). One advantage is that the method automatically determines if the variable is continuous or discrete and uses the appropriate class.

Both tests work in similar ways. They estimate a statistic T, that is somehow directly related to the marginal effects dy/dxN. Then they start drawing bootstrap samples {Y*,X*} that impose the null hypothesis (dy/dxN = 0) and estimate the same statistic for each sample [T*_1, T*_2, .... T*_b] (for b samples). If the original test statistic T is larger than the (1-a)th quantile of [T*_1, T*_2, .... T*_b] then we reject the null correctly (1-a)% of the times.

Currently, TestCoefC() allows testing joint hypothesis of continuous variables. For example:

model = nparam.TestCoefC(model=reg_model, var_pos=[0,1])

will test the joint null hypothesis that variables in position 0 and 1 both have no effect on the dependent variable versus the alternative that at least one of them does. This is similar to the popular F-test in the linear regression framework.

Currently, testing of joint hypothesis only works for continuous variables. Jointly testing discrete variables or mixed variables is not allowed.

VII) Added the nonparametric test for functional form

Parametric tests for functional form can be very restrictive because of their requirement to specify a specific alternative hypothesis. This means that the tests have power only in the direction of the alternative hypothesis, which may or may not be correct. For example, testing for functional form,  the null hypothesis could be something like

H0: E[y|x1, x2] = a + b1*x1 + b2*x2 + e

and the alternative:

Ha: E[y|x1, x2] = a + b1*x1 + b2*x2 + b3*log(x2) + e

If we reject the null in the direction of the alternative, this doesn't necessarily mean that the this alternative hypothesis is correct. Perhaps there is another Ha' which is the true one. 

Nonparametric tests for functional form don't suffer from this problem because they test the null

H0: E[y|X] = m(x,b)

versus the alternative

Ha: E[y|X] = g(x) != m(x,b)

This nonparametric test is implemented through the TestFForm() class.The following example illustrates how to implement the nonparametric test for functional form. Suppose that we want to test if y = x*b + e, i.e the functional dependence between y and x is linear.

import statsmodels.api as sm
import statsmodels.nonparametric as nparam

lin_fform = lambda x, b: x*b  # Linear functional form
lstsq_estimator = lambda y, x: sm.OLS(y,x).fit().params  # Least sq. estimator for beta

test = nparam.TestFForm(tydat=[y], txdat=[x], var_type='c', bw='cv_ls', fform=lin_fform, estimator=lstsq_estimator)

print "The significance level is ", test.sig

where fform is the functional form to be tested and estimator is a function that returns the coefficients (marginal effects) for the model. Since this is a nonparametric test, which involves kernel estimation of g(x), we also need to specify the bandwidth selection procedure and the variable type in txdat. Ideally estimator will be the Non-linear least square estimator (NLLS).

VIII) Summary

Here is a quick list of all the features added since the midterm evaluations:
  • AIC Hurvich bandwidth selection procedure for nonparametric regression
  • Marginal effects for local linear and local constant estimator in nonparametric regression for both continuous and discrete variables
  • Efficient bandwidth estimation procedures by breaking down the data into smaller parts. This significantly reduces the computational time required for the data-driven bandwidth selection procedures
  • Significance tests for nonparametric regression -- Reg().sig_test(var_pos)
  • Semiparametric partially linear model -- SemiLinear()
  • Semiparametric single index model -- SemiIndexModel()
  • Nonparametric censored regression model -- CensoredReg()
  • Nonparametric tests for continuous variables in a nonparametric regression  -- TestRegCoefC()
  • Nonparametric tests for discrete variables in a nonparametric regression -- TestRegCoefD()
  • Nonparametric tests for function form -- TestFForm()
  • Tests and examples for the most prominent features of the models

IX) Areas for future work and improvement

Naturally there are many parts of this code where improvements could be made. I will try to summarize some of the most important areas, that in my opinion should be improved in the future.

The speed of the data-driven bandwidth selection procedures is still quite low. For some models, R's np package converges more quickly than statsmodels. Therefore, some code optimization for speed should be on the top of our wishlist. For example, some of the for-loops (e.g. when estimating leave-one-out likelihoods) could be made parallel with the help of libraries such as joblib. Furthermore, some parts of the code (the kernels?) could be re-written in Cython to increase their speed.

The censored regression class CensoredReg() could be modified so that it can handle both left and right censored data.

The nonparametric functional form test TestFForm() should use as the default value for the estimator input a NLLS. This is currently under development in statsmodels so as soon as this is done this class should be revisited and tested. 

More tests for all features of all models are always on the wishlist. 

Thursday, July 12, 2012

GSoC Midterm Evaluations

My goal for the GSoC was to add to statsmodels the capability to estimate multivariate, mixed data, unconditional and conditional densities with nonparametric methods; nonparametric regressions and popular nonparametric models.
The goal for the first half of the GSoC was to create the building blocks of a nonparametric estimation library. These are multivariate unconditional density estimation, multivariate conditional density estimation and nonparametric regression. Many of the models to come are heavily dependent on these three classes. Most of the work that I set out to do in my GSoC application has been completed. However, there are still several missing parts that need to be completed and/or optimized.

The challenge when coding nonparametric estimators is twofold. On the one hand you  have to make sure that the estimators are correctly implemented and yield the right results, but at the same time you need to program them efficiently. This is because many of the nonparametric methods are very computationally intensive and often the time they need to converge is prohibitively long.

Here is an overview of the completed parts:

I) Unconditional Multivariate Density Estimation

This is the fundamental  class in the nonparametric estimation. The goal is to calculate the joint probability density of many variables that can be of mixed types (continuous, discrete ordered and discrete unordered): P(x1, x2, ... , xN). This can be implemented with UKDE class (acronym for Unconditional Kernel Density Estimation):

UKDE (tdat, var_type, bw)

tdat is a list of 1D arrays comprising of the training data
var_type is a string that specifies the variable type (e.g. 'ccouc' specifies that the variables are of type continuous, continuous, ordered, unordered and continuous in that order)
bw is either the user-specified bandwidth in which case it is a 1D vector of bandwidth values or it can be a string that specifies the bandwidth selection method. The possible methods are:

'cv_ls' for cross-validation least squares
'cv_ml' for cross-validation maximum likelihood
'normal_reference' for the Scott's rule of thumb (aka normal reference rule of thumb)

The UKDE class has the following two methods


The pdf(edat) method calculates the probability density function at the evaluation points edat. edat is NxK where K is the number of variables in tdat and N is the number of evaluation (sets of) points. If edat is not specified by the user then the pdf is estimated at the training data tdat.

The structure of UKDE.cdf(edat) is similar. The difference is that it returns the cumulative distribution function evaluated at edat: P(X<=x).

All components of UKDE (pdf, cdf, bw methods etc.) are tested in test_nonparametric2.py (with both real and generated data) and all results are identical (or very close) to those reported by the np package in R written by Jeff Racine (whose text and papers are the primary reference for the discussed methods). A few univariate graphical examples for the nonparametric estimation of popular distributions (f, Poisson, Pareto, Laplace etc.) are presented in ex_univar_kde.py.

The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

II) Conditional Multivariate Density Estimation

The density of a set of variables Y=(y1,y2,...,yP), conditional on X=(x1,x2,...xQ) is denoted f(Y|X) and equals the ration of the joint dnesity and the marginal density of X or f(Y,X)/f(X).
One difficulty that arises in conditional estimation is that the procedure for the bandwidth selection in the conditional case is different than the one in the unconditional case which prevents simple recycling of the code in UKDE.

Conditional density estimation is implemented with the class CKDE (acronym for Conditional Kernel Density Estimation)
CKDE(tydat, txdat, dep_type, indep_type, bw)

tydat is the dependent variable(s)
txdat is the independent variable(s)
dep_type and indep_type are strings specifying the types of the dependent and independent varaibles
bw is either user-specified bandwidth values or bandwidth selection method (similar as in UKDE)

CKDE also has the option of calculating the PDF and CDF of the density with the two methods
CKDE.pdf(eydat, exdat) and CKDE.cdf(eydat, exdat)
if left unspecified eydat and exdat default to tydat and txdat

Note: currently eydat needs to be N x (P+Q) and exdat is NxQ. To be fixed soon.

The functionality of the CKDE including the pdf, cdf and bandwidth estimates have been cross-checked with the np package and produce identical or very similar results. All tests are in test_nonparametric2.py.
The cross-validation least squares bandwidth selection method is still a little faster in R so some optimization of the code for speed could be required.
The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

III) Nonparametric Regression

The model is y = g(X) + e
A regression is an estimation of E[y|X]. When the functional form of g(X) is not specified it can be estimated with kernel methods.

Nonparametric regression can be implemented with the class Reg:
Reg(tydat, txdat, var_type, ret_type, bw)
tydat is the dependent (left-hand-side variable)
txdat is a list of independent variables
var_type is the independent variable type (txdat)
reg_type is the type of estimator. It can take on two values: 
    'lc' - local constant (default)
    'll' - local linear estimator
bw: is either the user-specified vector of bandwidth values or a bandwidth selection method:
    'cv_ls' - cross validation least squares

Reg.mean(edat) returns the conditional mean E[y|X=edat]. If edat is not specified it defaults to the training data txdat.
Reg.r-squared() returns the R-Squared of the model (the model fit)

The bandwidth selection methods, R-Squared and conditional mean have been tested and the results are identical or very similar to those produced by the np package in R. 

IV) List of completed elements

  • Gaussian kernel (for continuous variables)
  • Aitchison-Aitken kernel (for unordered discrete variables)
  • Wang-van Ryzin kernel (for ordered discrete variables)
  • Epanechnikov kernel -- has been dropped from KernelFunctions.py (can be added again at some point if we decide to give the user the ability to specify kernels. However, the literature is unanimous that the kernel type is of small importance to the density estimation)
  • "Convolution kernels" for all three types of kernels (used in the conditional pdf and cdf estimation
  • "CDF" kernels for all three types of kernels (the integrals of the kernels) used for the fast estimation of conditional and unconditional cumulative distribution
Density Estimation
  • Normal reference rule of thumb bandwidth selection method (Scott's) for both UKDE and CKDE
  • Cross-validation least squares for both UKDE and CKDE for mixed data types
  • Cross-validation maximum likelihood for both UKDE and CKDE for mixed data types
  • Probability density function (PDF) for both UKDE and CKDE for mixed data types
  • Cumulative distribution function (CDF) for both UKDE and CKDE for mixed data types
Nonparametric Regression
  • Local constant estimator
  • Local linear estimator
  • Cross validation least squares bandwidth selection method for both estimators
  • Conditional mean for both estimation
  • R-Squared (fit for the model)
  • Tests for all density and regression methods and bandwidth selection procedures
  • Graphical example with popular distributions in ex_univar_kde.py
  • Docstrings with references, LaTeX formulas and (some) examples
The plug-in bandwidth selection methods have been dropped because of their lack of robustness in a multivariate setting and inferiority to the data-driven methods like cv_ls and cv_ml.

V) Still to do:

  • Add marginal effects for regression
  • Add significance tests for regression
  • Add AIC bandwidth selection method to regression class
(I am currently working on these and they might be ready by the end of the week. )

Wednesday, July 11, 2012

KDE estimate of Pareto RV

support = np.random.pareto(a, size = N)
rv = stats.pareto(a)
ix = np.argsort(support)

dens_normal = nparam.UKDE(tdat=[support], var_type='c', bw='normal_reference')
dens_cvls = nparam.UKDE(tdat=[support], var_type='c', bw='cv_ls')
dens_cvml = nparam.UKDE(tdat=[support], var_type='c', bw='cv_ml')
plt.plot(support[ix],rv.pdf(support[ix]), label='Actual')
plt.plot(support[ix],dens_cvls.pdf()[ix], label='CV_LS')
plt.plot(support[ix],dens_cvml.pdf()[ix], label='CV_ML')
plt.title("Nonparametric Estimation of the Density of Pareto Distributed Random Variable")

Out-of-sample performance of OLS vs. Nonparametric Regression

The true usefulness of a regression estimator should be judged by its out-of-sample performance. It is easy to achieve a perfect fit for a finite sample by fitting an N-order polynomial to the data. However, the estimator will then suffer by "high-variance" or poor out-of-sample forecasting accuracy.

Let's build on the example from the previous post http://statsmodels-np.blogspot.com/2012/07/nonparametric-regression.html

To summarize we are using the ccard dataset in statsmodels which has data on the average credit card expenditures of 72 individuals. We are interested in studying the relationship between income and credit card expenditure. We saw that, while a linear relationship is inadequate a quadratic relationship provides a good fit for the data.

Let's use the first 57 observations to train our estimates. To get the kernel regression estimate in statsmodels:

model = nparam.Reg(tydat=[np.log(avgexp[0:57])], txdat=[income[0:57]], var_type='c', bw='cv_lc')

We obtain the least-squares coefficients for the model Log(AvgExp) = a + b1*income +b2*income^2 + e to be:
a = 3.11, b1= 0.72, and b2= - 0.04

We then use the estimates from both models to forecast the remaining 15 observations. The accuracy of the forecast is judged by sum of the squared forecast error (SSFE) and the Mean Squared Forecast Error (MSFE).
To obtain the nonparametric out-of-sample forecasts we can simply specify the edat input in the Cond_Mean attribute of the Reg class:

np_fcast = model.Cond_Mean(edat=income[57:72])

The SSFE with OLS is 12.09 while in the nonparametric case it is only 3.416.

The MSFE with OLS is 0.81, while in the nonparametric case it is significantly lower : 0.23

Nonparametric Regression


To regress a left-hand-variable y on several right-hand-side variables X = {x_1, x_2,...,x_N} means to estimate the conditional expected value (mean) E[y|X]. Another way to formulate the problem is to say that y = g(X) +e , where e is the error term or residual.

Traditionally, calculating the conditional mean is done by minimizing the sum of squared residuals e = y - g(X), where we assume that we know the true functional form. For example we could assume a linear relationship between X and y so that g(X) = a + b1*x_1 + ...+ bN*x_N. Then the problem is reduced to estimating the unknown coefficients a, b1, b2, ... , bN.

Although computationally efficient, this approach suffers from one major limitation. The researcher needs to specify beforehand the functional form of g(X), which is often difficult to do. Therefore, the results of the traditional estimation are only as good as the assumption of the underlying function form of the relationship between y and X. 

In a non-parametric regression, the researcher can avoid making this uncomfortable assumptions about data generating process and can estimate the conditional mean E[y|X] with minimal assumptions about the data. The non-parametric approach relies on using kernel density estimates of g(X).


For this example we will use ccard dataset in statsmodels. The data consists of the average credit card expenditures of 72 individuals along with information about their income, whether they own or rent a house and their age. Suppose we are interested in studying the effects of income on credit card spending. 

One way to proceed is parametrically by assuming that we know the functional relationship between Log(AvgExp) and income. The plot of the data suggests that a linear estimator will fit the data with high bias and maybe a quadratic relationship is more appropriate. We can use the OLS to estimate the following regression 
Log(AvgExp) = a + b1*income + b2* income^2 + e

Alternatively we can proceed non-parametrically without any strong assumptions about the functional relationship between credit card expenditure and income. To do that within statsmodels we can use the nonparametric estimators:

import statsmodels.nonparametric as nparam
model = nparam.Reg(tydat=[np.log(avgexp)], txdat=[income], var_type='c', bw='cv_lc')

where var_type specifies the type of the dependent variable (continuous in this case) and bw specifies that bandwidth regression estimator (in this case the cross validation least squares local constant estimator)

we can obtain the fitted values (the conditional mean) by
np_fitted = model.Cond_Mean()

and we can plot them against the OLS estimates:

Note that we didn't have to specify a quadratic relationship (include income^2 in the list of dependent variables) in the nonparametric approach.

We can obtain the R-Squared in the nonparametric approach:
which is 0.25

The bandwidth estimate is 0.78 (model.bw)

As usual, the nonparametric approach comes at a cost - the researcher needs more data for robust estimates. Furthermore, the data-driven bandwidth selection methods are significantly more computationally intensive. 

However a nonparametric approach will usually outperform an incorrectly specified parametric approach. For example if we assume a linear relationship between the log of average expenditure and income: Log(AvgExp) = a +b*income + e, then the sum of squared residuals for the OLS is 69.32, while for the nonparametric estimator is lower: 67.37

Wednesday, June 20, 2012

Estimating Probability Densities in Statsmodels

The nonparametric estimation in statsmodels relies on two main classes- UKDE and CKDE. Each class has attributes that store the probability density (pdf), the cumulative distribution function (cdf) and the bandwidth (bw). Currently the classes can handle mixed variable types (continuous and discrete data) and multiple bandwidth selection methods. 

UKDE implements the unconditional kernel density estimation. Suppose you would like to estimate the joint probability density of two variables, say X and Y. And suppose that X is continuous and Y is some ordered discrete variable. To do this with statsmodels you simply have to create an instance of the class UKDE:

udens = UKDE (tdat = [X, Y], var_type = 'co', bw = 'cv_ls')

tdat is the training data (in this case a list of two arrays), var_type specifies the type of variables in tdat (continuous and ordered) and bw specifies the bandwidth method to be used (in this case least squares cross validation). Now that the density has been estimated suppose you would like to calculate the probability of a particular realization of X = x and a particular Y = y. To do this:

udens.pdf (edat = [x,y] )

where edat is the evaluation data. x,y can also be arrays if the user wants to calculate the density at multiple points at the same time.

An important part of the nonparametric estimation is the calculation of the bandwidth. This is controlled by the input parameter bw. Currently the user can choose three methods: normal reference rule of thumb (bw='normal_reference'), maximum likelihood cross-validation (bw = 'cv_ml') and least squares cross-validation (bw = 'cv_ls'). Or alternatively the user can specify an array of values to be used for the bandwidth. The bandwidth estimation is stored in the bw attribute of the UKDE class. To access it:


The conditional kernel density estimation is implemented through the class CKDE. For example

cdens = CKDE (tydat = [X,Y], txdat = [V, W], dep_type = 'co', indep_type = 'cc', bw = 'cv_ml')

This will estimate the conditional probability density P (X,Y | V, W) -- the joint probability of X and Y given W and V. tydat and txdat are the dependent and independent data each of which has a variable type controlled by dep_type and indep_type. In this case the X is continous and Y is ordered while both independent variables V and W are continuous. The bandwidth selection method is maximum likelihood cross-validation which runs faster than least squares cross-validation.

To access the value of the conditional pdf for particular data x,y,v,w simply try:

cdens.pdf (eydat = [x,y], exdat = [v,w])

Sunday, June 10, 2012

A Few Words on Nonparametric Estimation

The main idea behind this Google Summer of Code project is to expand the nonparametric capabilities of statsmodels - a statistical library for Python. Nonparametric estimation requires almost no assumptions about the true distribution. We only require that the "true" distribution is smooth and differentiable.

The most well-known example of a nonparametric estimation is the simple histogram. By looking at the frequency of the data we infer characteristics of its probability distribution (normality, skewness, variance etc.). The field of nonparametric econometrics take this idea a little further by developing theoretically consistent ways of dealing with the bandwidth selection (e.g. number and width of the bins), incorporating multiple variables and estimating joint probability distributions of the type f(X_1,X_2,...X_n), working with mixed data types (continuous, ordered and unordered variables) estimating conditional densities etc.

Of course nonparametric estimation is not a "silver bullet". Not having to specify a priori assumptions about the true state of the world is a luxury and it comes at a price. A major drawback of nonparametric estimation is that consistent results require a great deal more data than the usual parametric methods. Furthermore, some of the bandwidth selection methods are computationally intensive and can take significant amounts of computational time. Working with many variables can also be challenging as one quickly runs into the "curse of dimensionality" - adding continuous variables rapidly increases the need for more data.

That being said, the growing computational capabilities of computers combined with the rapid accumulation of data from all walks of life will make the nonparametric methods an appealing inferential tool.

Thursday, April 5, 2012

GSoC Application

As we get better at collecting data and as our computational resources
continue to increase, nonparametric methods (despite the fact that
they require more data and are more computationally intensive) will
become more and more appealing to researchers. There are several
commercial packages such as Matlab and Mathematica that can currently
handle some nonparametric estimation. In addition some open source
packages like R have libraries that can handle some nonparametric
estimation [4]. The goal of this project is to develop an open-source,
Python-based alternative to these sources within statsmodels (see [1]
and [6]) which would make the package even more appealing to
practitioners and academics and hopefully make Python the primary
choice for computational work. 
The main focus of my summer work will be to expand the current
nonparametric capabilities of statsmodels [2,3] in three main
directions: develop the fully data-driven bandwidth selection methods
and improve the existing “rule-of-thumb” methods; make it possible to
handle conditional and unconditional multivariate kernel density
estimation; and work on popular nonparametric models (see the textbook
Nonparametric Econometrics by Qi Li and Jeff Racine, 2007) 
Project Schedule
Get familiar with the profiling tools for Python and organize and
familiarize with the existing code in the sandbox [3]. Look for
tutorials for optimization for speed of the data-driven methods for
bandwidth selection. 
Week 1 – 2 (May 21 – June 3)
Start work on the bandwidth selection methods. Add to the current
“rule-of-thumb” methods, fully data-driven methods such as likelihood
cross and least-squares cross validation and the Hurvich, Simonoff and
Tsai (1998) bandwidth selection method. Introduce several “plug-in”
bandwidth selection procedures for some of the more popular
distributions. This should improve the current univariate kernel
density estimation procedures in statsmodels. 
Week 2 – 4 (June 4 – June 17)
Begin work on two major classes: multivariate unconditional density
estimator and multivariate conditional density estimators. Adapt the
existing bandwidth selection procedures to handle the multivariate
density estimation. Create two more classes that will estimate the
cumulative densities in the conditional and unconditional case. 
Week 4 – 6 (June 18 – July 1)
Develop a class that fits nonparametric regression models of the type
y=g(x)+e, where x is multivariate, and implements the local constant
kernel estimator and the local linear kernel estimator proposed by
Stone(1977) and Cleveland (1979) with appropriate significance tests
and marginal effects 
Week 6 – 8 (July 2 – July 15)
Midterm (July 13) . The work between week 1 and week 6 will form the
backbone of the models to come. Code the appropriate tests for the
conditional, unconditional density estimators and the nonparametric
regression. Cross-check results with the nonparametric package “np”
written for R and make sure all computational methods are working
properly [4]. 
Week 8 – 10 (July 16 – July 29)
Begin work on extending the model library. Write two classes that can
fit semiparametric Tobit models and semiparametric censored regression
Week 10 – 12 (July 30 – August 12)
Explore the feasibility of including more advanced models such as
nonparametric simultaneous equation models and nonparametric panel
data models. Check if there is existing code that overlaps and start
the groundwork. These should overlap with the current existing
capabilities of statsmodels [1]. Begin work on the documentation for
the models and start writing tests for the nonparametric models
developed in the second half of the summer. Compare results with
other existing packages. 
Week 12 -  (August 13 - )
Polish up and improve any remaining issues with the code. Ensure that
any issues with the documentation are complete. 
About me
I am currently completing my 9-th semester in the economics PhD
program at American University in Washington, DC. I have completed all
my required course work and I am currently doing research for my
dissertation which is focused on the implementation of nonparametric
and information theoretic methods to continuous double auction
financial markets. A substantial part of my work is on applying kernel-
based methods to study the dynamics of asset returns conditional on
microstructure variables such as order flow and order book
characteristics. Some of my Python code on nonparametric estimation,
time series, microstructure and genetic algorithms is publicly
available [5]. Currently, I have been using R’s nonparametric package
“np” through rpy2 and rpy but this has its drawbacks. I would like to
help develop the nonparametric capabilities of statsmodels and
contribute to the drive to make python the primary choice for
computational work of academics and professionals alike. 
Contact info
Name: George Panterov
Project Blog: http://statsmodels-np.blogspot.com/
Project  Wiki: https://github.com/statsmodels/statsmodels/wiki/GSoC-Ideas 

GSoC 2012 application

This is the first post as part of my GSoC 2012 application