Wednesday, June 20, 2012

Estimating Probability Densities in Statsmodels

The nonparametric estimation in statsmodels relies on two main classes- UKDE and CKDE. Each class has attributes that store the probability density (pdf), the cumulative distribution function (cdf) and the bandwidth (bw). Currently the classes can handle mixed variable types (continuous and discrete data) and multiple bandwidth selection methods. 

UKDE implements the unconditional kernel density estimation. Suppose you would like to estimate the joint probability density of two variables, say X and Y. And suppose that X is continuous and Y is some ordered discrete variable. To do this with statsmodels you simply have to create an instance of the class UKDE:

udens = UKDE (tdat = [X, Y], var_type = 'co', bw = 'cv_ls')

tdat is the training data (in this case a list of two arrays), var_type specifies the type of variables in tdat (continuous and ordered) and bw specifies the bandwidth method to be used (in this case least squares cross validation). Now that the density has been estimated suppose you would like to calculate the probability of a particular realization of X = x and a particular Y = y. To do this:

udens.pdf (edat = [x,y] )

where edat is the evaluation data. x,y can also be arrays if the user wants to calculate the density at multiple points at the same time.

An important part of the nonparametric estimation is the calculation of the bandwidth. This is controlled by the input parameter bw. Currently the user can choose three methods: normal reference rule of thumb (bw='normal_reference'), maximum likelihood cross-validation (bw = 'cv_ml') and least squares cross-validation (bw = 'cv_ls'). Or alternatively the user can specify an array of values to be used for the bandwidth. The bandwidth estimation is stored in the bw attribute of the UKDE class. To access it:

udens.bw

The conditional kernel density estimation is implemented through the class CKDE. For example

cdens = CKDE (tydat = [X,Y], txdat = [V, W], dep_type = 'co', indep_type = 'cc', bw = 'cv_ml')

This will estimate the conditional probability density P (X,Y | V, W) -- the joint probability of X and Y given W and V. tydat and txdat are the dependent and independent data each of which has a variable type controlled by dep_type and indep_type. In this case the X is continous and Y is ordered while both independent variables V and W are continuous. The bandwidth selection method is maximum likelihood cross-validation which runs faster than least squares cross-validation.

To access the value of the conditional pdf for particular data x,y,v,w simply try:

cdens.pdf (eydat = [x,y], exdat = [v,w])

2 comments: