Abstract
The present study addresses problems faced by data-driven social science caused by having too much or not enough data. In particular, an abundance of data or a (sudden) lack thereof makes it challenging to identify the most important predictors in a sea of noise using the most parsimonious and reproducible model possible. In this article, we present the model-X knockoff method, which was introduced by Candès et al. (2018) for reducing the false identification of significant effects due to flexibility-ambiguity issues, to a broader audience, particularly within the social sciences and humanities. Our goal is to provide an accessible starting point and ideally spark interest among researchers in these fields to explore how model-X knockoffs can enhance their work. The findings from a performance contrast simulation indicate that model-X knockoffs select fewer relevant variables than other statistical methods to automatically identify variables, resulting in fewer mistakes. The simulation findings also demonstrate that model-X knockoffs are stable and less sensitive to even small changes in the dataset than other procedures, making them a viable way to reduce researcher degrees of freedom and increase the reproducibility of scientific findings. An additional real data example demonstrates the operational utility of the simulation.
Original language | English |
---|---|
Article number | 101380 |
Journal | Social Sciences & Humanities Open |
Volume | 11 |
Early online date | 28 Feb 2025 |
DOIs | |
Publication status | Published - 2025 |
Keywords
- Big data
- High dimensional statistics
- Model-X knockoffs
- Multivariate linear regression
- Simulation study
- True model
- Variable selection