Thursday, July 12, 2012

GSoC Midterm Evaluations

My goal for the GSoC was to add to statsmodels the capability to estimate multivariate, mixed data, unconditional and conditional densities with nonparametric methods; nonparametric regressions and popular nonparametric models.
The goal for the first half of the GSoC was to create the building blocks of a nonparametric estimation library. These are multivariate unconditional density estimation, multivariate conditional density estimation and nonparametric regression. Many of the models to come are heavily dependent on these three classes. Most of the work that I set out to do in my GSoC application has been completed. However, there are still several missing parts that need to be completed and/or optimized.

The challenge when coding nonparametric estimators is twofold. On the one hand you  have to make sure that the estimators are correctly implemented and yield the right results, but at the same time you need to program them efficiently. This is because many of the nonparametric methods are very computationally intensive and often the time they need to converge is prohibitively long.

Here is an overview of the completed parts:

I) Unconditional Multivariate Density Estimation


This is the fundamental  class in the nonparametric estimation. The goal is to calculate the joint probability density of many variables that can be of mixed types (continuous, discrete ordered and discrete unordered): P(x1, x2, ... , xN). This can be implemented with UKDE class (acronym for Unconditional Kernel Density Estimation):

UKDE (tdat, var_type, bw)

where:
tdat is a list of 1D arrays comprising of the training data
var_type is a string that specifies the variable type (e.g. 'ccouc' specifies that the variables are of type continuous, continuous, ordered, unordered and continuous in that order)
bw is either the user-specified bandwidth in which case it is a 1D vector of bandwidth values or it can be a string that specifies the bandwidth selection method. The possible methods are:

'cv_ls' for cross-validation least squares
'cv_ml' for cross-validation maximum likelihood
'normal_reference' for the Scott's rule of thumb (aka normal reference rule of thumb)

The UKDE class has the following two methods

UKDE.pdf(edat)
UKDE.cdf(edat)

The pdf(edat) method calculates the probability density function at the evaluation points edat. edat is NxK where K is the number of variables in tdat and N is the number of evaluation (sets of) points. If edat is not specified by the user then the pdf is estimated at the training data tdat.

The structure of UKDE.cdf(edat) is similar. The difference is that it returns the cumulative distribution function evaluated at edat: P(X<=x).

All components of UKDE (pdf, cdf, bw methods etc.) are tested in test_nonparametric2.py (with both real and generated data) and all results are identical (or very close) to those reported by the np package in R written by Jeff Racine (whose text and papers are the primary reference for the discussed methods). A few univariate graphical examples for the nonparametric estimation of popular distributions (f, Poisson, Pareto, Laplace etc.) are presented in ex_univar_kde.py.

The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

II) Conditional Multivariate Density Estimation

The density of a set of variables Y=(y1,y2,...,yP), conditional on X=(x1,x2,...xQ) is denoted f(Y|X) and equals the ration of the joint dnesity and the marginal density of X or f(Y,X)/f(X).
One difficulty that arises in conditional estimation is that the procedure for the bandwidth selection in the conditional case is different than the one in the unconditional case which prevents simple recycling of the code in UKDE.

Conditional density estimation is implemented with the class CKDE (acronym for Conditional Kernel Density Estimation)
CKDE(tydat, txdat, dep_type, indep_type, bw)

where:
tydat is the dependent variable(s)
txdat is the independent variable(s)
dep_type and indep_type are strings specifying the types of the dependent and independent varaibles
bw is either user-specified bandwidth values or bandwidth selection method (similar as in UKDE)

CKDE also has the option of calculating the PDF and CDF of the density with the two methods
CKDE.pdf(eydat, exdat) and CKDE.cdf(eydat, exdat)
if left unspecified eydat and exdat default to tydat and txdat

Note: currently eydat needs to be N x (P+Q) and exdat is NxQ. To be fixed soon.

The functionality of the CKDE including the pdf, cdf and bandwidth estimates have been cross-checked with the np package and produce identical or very similar results. All tests are in test_nonparametric2.py.
The cross-validation least squares bandwidth selection method is still a little faster in R so some optimization of the code for speed could be required.
The convergence time for the bandwidth selection methods has been tested and is consistent with the theoretical results.

III) Nonparametric Regression

The model is y = g(X) + e
A regression is an estimation of E[y|X]. When the functional form of g(X) is not specified it can be estimated with kernel methods.

Nonparametric regression can be implemented with the class Reg:
Reg(tydat, txdat, var_type, ret_type, bw)
where
tydat is the dependent (left-hand-side variable)
txdat is a list of independent variables
var_type is the independent variable type (txdat)
reg_type is the type of estimator. It can take on two values: 
    'lc' - local constant (default)
    'll' - local linear estimator
bw: is either the user-specified vector of bandwidth values or a bandwidth selection method:
    'cv_ls' - cross validation least squares

Reg.mean(edat) returns the conditional mean E[y|X=edat]. If edat is not specified it defaults to the training data txdat.
Reg.r-squared() returns the R-Squared of the model (the model fit)

The bandwidth selection methods, R-Squared and conditional mean have been tested and the results are identical or very similar to those produced by the np package in R. 

IV) List of completed elements

Kernels:
  • Gaussian kernel (for continuous variables)
  • Aitchison-Aitken kernel (for unordered discrete variables)
  • Wang-van Ryzin kernel (for ordered discrete variables)
  • Epanechnikov kernel -- has been dropped from KernelFunctions.py (can be added again at some point if we decide to give the user the ability to specify kernels. However, the literature is unanimous that the kernel type is of small importance to the density estimation)
  • "Convolution kernels" for all three types of kernels (used in the conditional pdf and cdf estimation
  • "CDF" kernels for all three types of kernels (the integrals of the kernels) used for the fast estimation of conditional and unconditional cumulative distribution
Density Estimation
  • Normal reference rule of thumb bandwidth selection method (Scott's) for both UKDE and CKDE
  • Cross-validation least squares for both UKDE and CKDE for mixed data types
  • Cross-validation maximum likelihood for both UKDE and CKDE for mixed data types
  • Probability density function (PDF) for both UKDE and CKDE for mixed data types
  • Cumulative distribution function (CDF) for both UKDE and CKDE for mixed data types
Nonparametric Regression
  • Local constant estimator
  • Local linear estimator
  • Cross validation least squares bandwidth selection method for both estimators
  • Conditional mean for both estimation
  • R-Squared (fit for the model)
Other
  • Tests for all density and regression methods and bandwidth selection procedures
  • Graphical example with popular distributions in ex_univar_kde.py
  • Docstrings with references, LaTeX formulas and (some) examples
The plug-in bandwidth selection methods have been dropped because of their lack of robustness in a multivariate setting and inferiority to the data-driven methods like cv_ls and cv_ml.

V) Still to do:

  • Add marginal effects for regression
  • Add significance tests for regression
  • Add AIC bandwidth selection method to regression class
(I am currently working on these and they might be ready by the end of the week. )

4 comments:

  1. Discrete values are used in function theory and limits.
    Discrete Variable

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi Zahra, The code is part of the official statsmodels package. You can download the package and use it straight from there

      Delete
  3. Is it the same as statsmodels.nonparametric.kernel_density.KDEMultivariateConditional? otherwise I couldn't find the documentation for it?!!!!

    ReplyDelete