Abstract
Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately.
In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble
a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases.
In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble
a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases.
Original language | English |
---|---|
Article number | 134 |
Journal | SN Computer Science |
Volume | 1 |
Issue number | 3 |
Early online date | 23 Apr 2020 |
DOIs | |
Publication status | Published - May 2020 |
Keywords
- Classification algorithms
- Dissimilarity measures
- Ensemble techniques
- Missing data
- Multiple imputation
Profiles
-
Beatriz De La Iglesia
- School of Computing Sciences - Professor & Head of School
- Norwich Institute for Healthy Aging - Member
- Norwich Epidemiology Centre - Member
- Data Science and AI - Member
Person: Research Group Member, Research Centre Member, Academic, Teaching & Research
-
Wenjia Wang
- School of Computing Sciences - Professor of Artificial Intelligence
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching & Research