Regression trees with heteroscedastic data

Regression trees are popular tools in statistical prediction of a response based on one or more explanatory variables. However, their basic construction depends on the assumption that the data on which they are used are "homoscedastic", meaning that the potential for response variability is the same for all sampled units. It is common that this assumption is not true in actual data sets, a condition called "heteroscedasticity". The potential effects of heteroscedasticity on prediction performance of regression trees is largely unknown. Researching this is complicated by the fact that there are numerous mathematical models to represent how heteroscedasticity can occur. We have previously used one very simple model and one explanatory variable to explore some of the fundamental changes that can occur in the regression-tree predictions. We plan to extend this work to alternative models that may be more realistic representations of the nature of heteroscedasticity. We also plan to use multiple explanatory variables, so that the effects of heteroscedasticity on the tree's variable selection can be assessed. Specifically, we will (1) identify potential models to represent heteroscedasticity realistically, (2) simulate data from these models, (3) Fit regression trees to the simulated data, and (4) measure the performance of the trees in various ways. Among these performance measures will be assessments of (a) how accurately the tree makes splits, (b) how close to the true values its predictions are (both absolutely and relative to the local variability), and (c) how how frequently the tree chooses to split on variables that are important.