I’m just about to submit some of our newest research involving Bayesian regression trees!  A really exciting project on MCMC samplers that has enabled us to use this flexible and scaleable regression model approach in challenging uncertainty quantification problems.  Update: here’s a link to the draft.  Abstract follows.

Bayesian regression trees are flexible non-parametric models that are well suited to many modern statistical regression problems.  Many such tree models have been proposed, from the simple single-tree model to more complex tree ensembles.  Their non-parametric formulation allows for  effective and efficient modeling of datasets exhibiting complex non-linear relationships between the model predictors and observations.  However, the mixing behaviour of the MCMC sampler is sometimes poor.  This is because the proposals in the sampler are typically local alterations of the tree structure, such as the birth/death of leaf nodes, which does not allow for efficient traversal of the model space.  This poor mixing can lead to inferential problems, such as under-representing uncertainty.  In this paper, we develop novel proposal mechanisms for efficient sampling.  The first is a rule perturbation proposal while the second we call tree rotation.  The perturbation proposal can be seen as an efficient variation of the change proposal  found in existing literature.  The novel tree rotation proposal is simple to implement  as it only requires local changes to the regression tree structure, yet it efficiently traverses disparate regions of the model space along contours of equal probability.  When combined with the classical birth/death proposal, the resulting MCMC sampler exhibits good acceptance rates and properly represents model uncertainty in the posterior samples.  We implement this sampling algorithm in the Bayesian Additive Regression Tree (BART) model and demonstrate its effectiveness on a prediction problem from computer experiments and a test function where structural tree variability is needed to fully explore the posterior.

Our paper on Parallel Bayesian Additive Regression Trees has been accepted for publication in the Journal of Computational and Graphical Statistics!  The arXiv preprint is available at http://arxiv.org/abs/1309.1906.  Abstract follows.

Bayesian Additive Regression Trees (BART) is a Bayesian approach to flexible non-linear regression which has been shown to be competitive with the best modern predictive methods such as those based on bagging and boosting. BART offers some advantages. For example, the stochastic search Markov Chain Monte Carlo (MCMC) algorithm can provide a more complete search of the model space and variation across MCMC draws can capture the level of uncertainty in the usual Bayesian way. The BART prior is robust in that reasonable results are typically obtained with a default prior specification. However, the publicly available implementation of the BART algorithm in the R package BayesTree is not fast enough to be considered interactive with over a thousand observations, and is unlikely to even run with 50,000 to 100,000 observations. In this paper we show how the BART algorithm may be modified and then computed using single program, multiple data (SPMD) parallel computation implemented using the Message Passing Interface (MPI) library. The approach scales nearly linearly in the number of processor cores, enabling the practitioner to perform statistical inference on massive datasets. Our approach can also handle datasets too massive to fit on any single data repository.

Our paper on Computer Model Calibration using the Ensemble Kalman Filter has been accepted for publication in Technometrics and will appear in the Conference on Data Analysis (CoDA) issue!  The arXiv preprint version is at http://arxiv.org/abs/1204.3547.  Abstract follows.

The ensemble Kalman filter (EnKF) (Evensen, 2009) has proven effective in quantifying uncertainty in a number of challenging dynamic, state estimation, or data assimilation, problems such as weather forecasting and ocean modeling. In these problems a high-dimensional state parameter is successively updated based on recurring physical observations, with the aid of a computationally demanding forward model that prop- agates the state from one time step to the next. More recently, the EnKF has proven effective in history matching in the petroleum engineering community (Evensen, 2009; Oliver and Chen, 2010). Such applications typically involve estimating large numbers of parameters, describing an oil reservoir, using data from production history that accumulate over time. Such history matching problems are especially challenging examples of computer model calibration since they involve a large number of model parameters as well as a computationally demanding forward model. More generally, computer model calibration combines physical observations with a computational model – a computer model – to estimate unknown parameters in the computer model. This paper explores how the EnKF can be used in computer model calibration problems, comparing it to other more common approaches, considering applications in climate and cosmology.