Heather Gunn, University of California, Los Angeles

Fitting a LASSO to multiply imputed data: A missing discussion

Spotlight Speaker

Behavioral science researchers often use standard linear regression to identify relevant predictors of an outcome of interest. Testing all predictors simultaneously can lead to overfitting and inflation of standard errors. Regularization methods like the LASSO reduce the risk of overfitting, increase model interpretability, and improve prediction in future samples; however, handling missing data when using regularization-based variable selection methods is complicated. Typically, researchers use listwise deletion or ad-hoc single-imputation strategies like mean imputation to handle missing data when fitting the LASSO, which can lead to loss of precision, substantial bias, and a reduction in predictive ability. In this talk, we describe three approaches for fitting a LASSO when using multiple imputation to handle missing data: a separate approach, a stacked approach, and the MI-LASSO. In the separate approach, a LASSO is fit to each imputed data set, resulting in a different selection of variables in each imputed data set. In the stacked approach, a single LASSO is fit to the stacked set of imputed data sets. Finally, the MI-LASSO uses the group LASSO to fit a LASSO to each imputed data set simultaneously, resulting in consistent variable selection. We illustrate how to implement these approaches in practice using an applied example, highlighting the different decision points needed for each approach. We end with a discussion of the implications for using each approach and additional research needed to solidify recommendations for best practices.

Log in