Thursday, April 5, 2012

GSoC Application

As we get better at collecting data and as our computational resources
continue to increase, nonparametric methods (despite the fact that
they require more data and are more computationally intensive) will
become more and more appealing to researchers. There are several
commercial packages such as Matlab and Mathematica that can currently
handle some nonparametric estimation. In addition some open source
packages like R have libraries that can handle some nonparametric
estimation [4]. The goal of this project is to develop an open-source,
Python-based alternative to these sources within statsmodels (see [1]
and [6]) which would make the package even more appealing to
practitioners and academics and hopefully make Python the primary
choice for computational work. 
The main focus of my summer work will be to expand the current
nonparametric capabilities of statsmodels [2,3] in three main
directions: develop the fully data-driven bandwidth selection methods
and improve the existing “rule-of-thumb” methods; make it possible to
handle conditional and unconditional multivariate kernel density
estimation; and work on popular nonparametric models (see the textbook
Nonparametric Econometrics by Qi Li and Jeff Racine, 2007) 
Project Schedule
Get familiar with the profiling tools for Python and organize and
familiarize with the existing code in the sandbox [3]. Look for
tutorials for optimization for speed of the data-driven methods for
bandwidth selection. 
Week 1 – 2 (May 21 – June 3)
Start work on the bandwidth selection methods. Add to the current
“rule-of-thumb” methods, fully data-driven methods such as likelihood
cross and least-squares cross validation and the Hurvich, Simonoff and
Tsai (1998) bandwidth selection method. Introduce several “plug-in”
bandwidth selection procedures for some of the more popular
distributions. This should improve the current univariate kernel
density estimation procedures in statsmodels. 
Week 2 – 4 (June 4 – June 17)
Begin work on two major classes: multivariate unconditional density
estimator and multivariate conditional density estimators. Adapt the
existing bandwidth selection procedures to handle the multivariate
density estimation. Create two more classes that will estimate the
cumulative densities in the conditional and unconditional case. 
Week 4 – 6 (June 18 – July 1)
Develop a class that fits nonparametric regression models of the type
y=g(x)+e, where x is multivariate, and implements the local constant
kernel estimator and the local linear kernel estimator proposed by
Stone(1977) and Cleveland (1979) with appropriate significance tests
and marginal effects 
Week 6 – 8 (July 2 – July 15)
Midterm (July 13) . The work between week 1 and week 6 will form the
backbone of the models to come. Code the appropriate tests for the
conditional, unconditional density estimators and the nonparametric
regression. Cross-check results with the nonparametric package “np”
written for R and make sure all computational methods are working
properly [4]. 
Week 8 – 10 (July 16 – July 29)
Begin work on extending the model library. Write two classes that can
fit semiparametric Tobit models and semiparametric censored regression
Week 10 – 12 (July 30 – August 12)
Explore the feasibility of including more advanced models such as
nonparametric simultaneous equation models and nonparametric panel
data models. Check if there is existing code that overlaps and start
the groundwork. These should overlap with the current existing
capabilities of statsmodels [1]. Begin work on the documentation for
the models and start writing tests for the nonparametric models
developed in the second half of the summer. Compare results with
other existing packages. 
Week 12 -  (August 13 - )
Polish up and improve any remaining issues with the code. Ensure that
any issues with the documentation are complete. 
About me
I am currently completing my 9-th semester in the economics PhD
program at American University in Washington, DC. I have completed all
my required course work and I am currently doing research for my
dissertation which is focused on the implementation of nonparametric
and information theoretic methods to continuous double auction
financial markets. A substantial part of my work is on applying kernel-
based methods to study the dynamics of asset returns conditional on
microstructure variables such as order flow and order book
characteristics. Some of my Python code on nonparametric estimation,
time series, microstructure and genetic algorithms is publicly
available [5]. Currently, I have been using R’s nonparametric package
“np” through rpy2 and rpy but this has its drawbacks. I would like to
help develop the nonparametric capabilities of statsmodels and
contribute to the drive to make python the primary choice for
computational work of academics and professionals alike. 
Contact info
Name: George Panterov
Project Blog:
Project  Wiki: 

GSoC 2012 application

This is the first post as part of my GSoC 2012 application