Wednesday, June 20, 2012

Estimating Probability Densities in Statsmodels

The nonparametric estimation in statsmodels relies on two main classes- UKDE and CKDE. Each class has attributes that store the probability density (pdf), the cumulative distribution function (cdf) and the bandwidth (bw). Currently the classes can handle mixed variable types (continuous and discrete data) and multiple bandwidth selection methods. 

UKDE implements the unconditional kernel density estimation. Suppose you would like to estimate the joint probability density of two variables, say X and Y. And suppose that X is continuous and Y is some ordered discrete variable. To do this with statsmodels you simply have to create an instance of the class UKDE:

udens = UKDE (tdat = [X, Y], var_type = 'co', bw = 'cv_ls')

tdat is the training data (in this case a list of two arrays), var_type specifies the type of variables in tdat (continuous and ordered) and bw specifies the bandwidth method to be used (in this case least squares cross validation). Now that the density has been estimated suppose you would like to calculate the probability of a particular realization of X = x and a particular Y = y. To do this:

udens.pdf (edat = [x,y] )

where edat is the evaluation data. x,y can also be arrays if the user wants to calculate the density at multiple points at the same time.

An important part of the nonparametric estimation is the calculation of the bandwidth. This is controlled by the input parameter bw. Currently the user can choose three methods: normal reference rule of thumb (bw='normal_reference'), maximum likelihood cross-validation (bw = 'cv_ml') and least squares cross-validation (bw = 'cv_ls'). Or alternatively the user can specify an array of values to be used for the bandwidth. The bandwidth estimation is stored in the bw attribute of the UKDE class. To access it:

The conditional kernel density estimation is implemented through the class CKDE. For example

cdens = CKDE (tydat = [X,Y], txdat = [V, W], dep_type = 'co', indep_type = 'cc', bw = 'cv_ml')

This will estimate the conditional probability density P (X,Y | V, W) -- the joint probability of X and Y given W and V. tydat and txdat are the dependent and independent data each of which has a variable type controlled by dep_type and indep_type. In this case the X is continous and Y is ordered while both independent variables V and W are continuous. The bandwidth selection method is maximum likelihood cross-validation which runs faster than least squares cross-validation.

To access the value of the conditional pdf for particular data x,y,v,w simply try:

cdens.pdf (eydat = [x,y], exdat = [v,w])

Sunday, June 10, 2012

A Few Words on Nonparametric Estimation

The main idea behind this Google Summer of Code project is to expand the nonparametric capabilities of statsmodels - a statistical library for Python. Nonparametric estimation requires almost no assumptions about the true distribution. We only require that the "true" distribution is smooth and differentiable.

The most well-known example of a nonparametric estimation is the simple histogram. By looking at the frequency of the data we infer characteristics of its probability distribution (normality, skewness, variance etc.). The field of nonparametric econometrics take this idea a little further by developing theoretically consistent ways of dealing with the bandwidth selection (e.g. number and width of the bins), incorporating multiple variables and estimating joint probability distributions of the type f(X_1,X_2,...X_n), working with mixed data types (continuous, ordered and unordered variables) estimating conditional densities etc.

Of course nonparametric estimation is not a "silver bullet". Not having to specify a priori assumptions about the true state of the world is a luxury and it comes at a price. A major drawback of nonparametric estimation is that consistent results require a great deal more data than the usual parametric methods. Furthermore, some of the bandwidth selection methods are computationally intensive and can take significant amounts of computational time. Working with many variables can also be challenging as one quickly runs into the "curse of dimensionality" - adding continuous variables rapidly increases the need for more data.

That being said, the growing computational capabilities of computers combined with the rapid accumulation of data from all walks of life will make the nonparametric methods an appealing inferential tool.