TY - JOUR

T1 - Automatically identifying relevant variables for linear regression with the Lasso method: A methodological primer for its application with R and a performance contrast simulation with alternative selection strategies

AU - Scherr, Sebastian

AU - Zhou, Jing

PY - 2020/7/2

Y1 - 2020/7/2

N2 - The abundance of available digital big data has created new challenges in identifying relevant variables for regression models. One statistical problem that gained relevance in the era of big data is high-dimensional statistical inference, when the number of variables greatly exceeds the number of observations. Typically, prediction errors in linear regression skyrocket when the number of included variables gets close to the number of observations, and ordinary least squares (OLS) regression no longer works in a high-dimensional scenario. Regularized estimators as a feasible solution include the Least Absolute Shrinkage and Selection Operator (Lasso), which we introduce to communication scholars here. We will include the statistical background of this technique that combines estimation and variable selection simultaneously and helps identify relevant variables for regression models in high-dimensional scenarios. We contrast the Lasso with two alternative strategies of selecting variables for regression models, namely, a theory-based “subset selection” of variables and a nonselective “all in” strategy. The simulation shows that the Lasso produces lower and more relatively stable prediction errors than the two alternative variable selection strategies, and it is therefore recommended to use, especially in high-dimensional settings typical in times of big data analysis.

AB - The abundance of available digital big data has created new challenges in identifying relevant variables for regression models. One statistical problem that gained relevance in the era of big data is high-dimensional statistical inference, when the number of variables greatly exceeds the number of observations. Typically, prediction errors in linear regression skyrocket when the number of included variables gets close to the number of observations, and ordinary least squares (OLS) regression no longer works in a high-dimensional scenario. Regularized estimators as a feasible solution include the Least Absolute Shrinkage and Selection Operator (Lasso), which we introduce to communication scholars here. We will include the statistical background of this technique that combines estimation and variable selection simultaneously and helps identify relevant variables for regression models in high-dimensional scenarios. We contrast the Lasso with two alternative strategies of selecting variables for regression models, namely, a theory-based “subset selection” of variables and a nonselective “all in” strategy. The simulation shows that the Lasso produces lower and more relatively stable prediction errors than the two alternative variable selection strategies, and it is therefore recommended to use, especially in high-dimensional settings typical in times of big data analysis.

UR - http://www.scopus.com/inward/record.url?scp=85074459384&partnerID=8YFLogxK

U2 - 10.1080/19312458.2019.1677882

DO - 10.1080/19312458.2019.1677882

M3 - Article

VL - 14

SP - 204

EP - 211

JO - Communication Methods and Measures

JF - Communication Methods and Measures

SN - 1931-2458

IS - 3

ER -