Abstract
Feature selection (FS) is increasingly important in data analysis and machine
learning in the big data era. However, how to use the data in feature selection
has become a serious issue as the conventional practice of using ALL the data in
FS may lead to selection bias and some suggest to use PART of the data instead.
This paper investigates the reliability and effectiveness of a PART approach implemented by cross validation mechanism in feature selection filters and compares it with the ALL approach. The reliability is measured by an Inter-system Average Tanimoto Index and the effectiveness of the selected features is measured by the mean generalisation accuracy of classification. The experimnts are carried out by using synthetic datasets generated with a fixed number of relevant features and varied numbers of irrelevant features and instances, and different level of noise, to mimic some possible real world environments. The results indicate that the PART approach is more effective in reducing the bias when the dataset is small but starts to lose its advantage as the dataset size increases.
learning in the big data era. However, how to use the data in feature selection
has become a serious issue as the conventional practice of using ALL the data in
FS may lead to selection bias and some suggest to use PART of the data instead.
This paper investigates the reliability and effectiveness of a PART approach implemented by cross validation mechanism in feature selection filters and compares it with the ALL approach. The reliability is measured by an Inter-system Average Tanimoto Index and the effectiveness of the selected features is measured by the mean generalisation accuracy of classification. The experimnts are carried out by using synthetic datasets generated with a fixed number of relevant features and varied numbers of irrelevant features and instances, and different level of noise, to mimic some possible real world environments. The results indicate that the PART approach is more effective in reducing the bias when the dataset is small but starts to lose its advantage as the dataset size increases.
Original language | English |
---|---|
Pages | 179-184 |
Number of pages | 6 |
Publication status | Published - Dec 2014 |
Event | 34th SGAI International Conference on Artificial Intelligence - Peterhouse College, Cambridge, United Kingdom Duration: 9 Dec 2014 → 11 Dec 2014 |
Conference
Conference | 34th SGAI International Conference on Artificial Intelligence |
---|---|
Country/Territory | United Kingdom |
City | Cambridge |
Period | 9/12/14 → 11/12/14 |
Keywords
- Feature Selection
- Cross-Validation
- Filters
- Reliability measure
Profiles
-
Wenjia Wang
- School of Computing Sciences - Professor of Artificial Intelligence
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching & Research