Abstract ------------------ As we get better at collecting data and as our computational resources continue to increase, nonparametric methods (despite the fact that they require more data and are more computationally intensive) will become more and more appealing to researchers. There are several commercial packages such as Matlab and Mathematica that can currently handle some nonparametric estimation. In addition some open source packages like R have libraries that can handle some nonparametric estimation . The goal of this project is to develop an open-source, Python-based alternative to these sources within statsmodels (see  and ) which would make the package even more appealing to practitioners and academics and hopefully make Python the primary choice for computational work.
The main focus of my summer work will be to expand the current nonparametric capabilities of statsmodels [2,3] in three main directions: develop the fully data-driven bandwidth selection methods and improve the existing “rule-of-thumb” methods; make it possible to handle conditional and unconditional multivariate kernel density estimation; and work on popular nonparametric models (see the textbook Nonparametric Econometrics by Qi Li and Jeff Racine, 2007)
Project Schedule --------------------------
Pre-GSoC Get familiar with the profiling tools for Python and organize and familiarize with the existing code in the sandbox . Look for tutorials for optimization for speed of the data-driven methods for bandwidth selection.
Week 1 – 2 (May 21 – June 3) Start work on the bandwidth selection methods. Add to the current “rule-of-thumb” methods, fully data-driven methods such as likelihood cross and least-squares cross validation and the Hurvich, Simonoff and Tsai (1998) bandwidth selection method. Introduce several “plug-in” bandwidth selection procedures for some of the more popular distributions. This should improve the current univariate kernel density estimation procedures in statsmodels.
Week 2 – 4 (June 4 – June 17) Begin work on two major classes: multivariate unconditional density estimator and multivariate conditional density estimators. Adapt the existing bandwidth selection procedures to handle the multivariate density estimation. Create two more classes that will estimate the cumulative densities in the conditional and unconditional case.
Week 4 – 6 (June 18 – July 1) Develop a class that fits nonparametric regression models of the type y=g(x)+e, where x is multivariate, and implements the local constant kernel estimator and the local linear kernel estimator proposed by Stone(1977) and Cleveland (1979) with appropriate significance tests and marginal effects
Week 6 – 8 (July 2 – July 15) Midterm (July 13) . The work between week 1 and week 6 will form the backbone of the models to come. Code the appropriate tests for the conditional, unconditional density estimators and the nonparametric regression. Cross-check results with the nonparametric package “np” written for R and make sure all computational methods are working properly .
Week 8 – 10 (July 16 – July 29) Begin work on extending the model library. Write two classes that can fit semiparametric Tobit models and semiparametric censored regression models.
Week 10 – 12 (July 30 – August 12) Explore the feasibility of including more advanced models such as nonparametric simultaneous equation models and nonparametric panel data models. Check if there is existing code that overlaps and start the groundwork. These should overlap with the current existing capabilities of statsmodels . Begin work on the documentation for the models and start writing tests for the nonparametric models developed in the second half of the summer. Compare results with other existing packages.
Week 12 - (August 13 - ) Polish up and improve any remaining issues with the code. Ensure that any issues with the documentation are complete.
About me ------------------- I am currently completing my 9-th semester in the economics PhD program at American University in Washington, DC. I have completed all my required course work and I am currently doing research for my dissertation which is focused on the implementation of nonparametric and information theoretic methods to continuous double auction financial markets. A substantial part of my work is on applying kernel- based methods to study the dynamics of asset returns conditional on microstructure variables such as order flow and order book characteristics. Some of my Python code on nonparametric estimation, time series, microstructure and genetic algorithms is publicly available . Currently, I have been using R’s nonparametric package “np” through rpy2 and rpy but this has its drawbacks. I would like to help develop the nonparametric capabilities of statsmodels and contribute to the drive to make python the primary choice for computational work of academics and professionals alike.