The impact of variable omission on variable importance measures of cart, random forest, and boosting algorithms

When researchers use statistical models to gain insights into various phenomena, they make a tacit assumption that most (if not all) of the important predictor variables with respect to the outcome are included in the analysis. However, in practice this may not always be possible, whether because some important variables could not be measured, or because a researcher was not aware of all such important predictors. Prior research has shown that when important variables are omitted from both linear and nonlinear regression models, the model coefficients can be biased, with greater levels of bias being associated with larger correlations between the missing and retained variables. However, very little work has examined how such omissions impact the performance of variable importance measures used with popular machine learning algorithms. Therefore, the purpose of this simulation study was to address this gap in the literature and thereby provide insights into the impact of such omissions on variable importance measures for classification and regression trees, random forests, and boosting algorithms. Results showed that when an important variable is omitted from an analysis, other predictors that are correlated with and/or involved in an interaction with it will have inflated variables importance measures themselves. An empirical example and implications of these results are discussed.

Fulltext