Haiyang (Jason) Jiang

Understanding and Estimating Predictive Performance of Statistical Learning Methods based on Data Properties.

Many Statistical Learning (SL) regression methods have been developed over roughly the last two decades, but no one model has been found to be the best across all sets of data. It would be useful if guidance were available to help identify when each different method might be expected to provide more accurate or precise predictions than competitors. We speculate that certain measurable features of a data set might influence methods’ potential ability to provide relatively accurate predictions. This thesis explores the potential to use measurable characteristics of a data set to estimate the prediction performance of different SL regression methods. We demonstrate this process on an existing set of 42 benchmark data sets. We measure a variety of properties on each data set that might be useful for differentiating between likely good- or poor-performing regression methods. Using repeated cross-validation, we measure the actual relative prediction performance of 12 well-known regression methods, including both classical linear techniques and more modern flexible approaches. Finally, we combine the performance measures and the data set properties into a multivariate regression model to identify which properties appear to be most important and to estimate the expected prediction performance of each method.