Ran Wang

Understanding Multicollineariy in Bayesian Model Averaging with BIC Approximation.

Bayesian model averaging (BMA) is a widely used method for model and variable selection. In particular, BMA with BIC approximation is a frequentist view of model averaging which saves the massive amount of computation compared to the full Bayesian approach. However, BMA with BIC approximation may give misleading results in linear regression models when multicollinearity is present. In this article, we explore the relationship between performance of BMA with BIC approximation and the true regression parameters and correlations among explanatory variables. Specifically, we derived approximate formulae in the context of a known regression model to predict the BMA behaviours from 3 aspects — model selection, variable importance and coefficient estimation.We used simulations to verify the accuracy of the approximations. Through mathematical analysis, we demonstrated that BMA may not identify the correct model as the highest probability model if the coefficient and correlation parameters combine to minimize the residual sum of squares of a wrong model. We found that if the regression parameters of important variables are relatively large, BMA is generally successful in model and variable selection. On the other hand, if the regression parameters of important variables are relatively small, BMA can be dangerous in predicting the best model or important variables, especially when the full model correlation matrix is close to singular.

The simulation studies suggest that our formulae are over-optimistic in predicting posterior probabilities of the true models and important variables. However, these formulae still provide us insights about the effect of collinearity on BMA.

Keywords: All subsets regression, Simulation, Model selection, Variable importance, Expected residual sum of squares.