Abstract
Divide-and-Conquer is probably the most commonly used strategy to deal with a big data that is too big to be loaded into any computing systems memory as a whole for analysis. It partitions such a big dataset into many smaller subsets that can be loaded into computer memory separately to induce models, which can be combined by machine learning ensemble methods. However, it is not clear that how the size of subsets may affect the learning performance of individual models and their ensemble. This paper proposes an ensemble based algorithm to quickly detect their relational patterns in terms of ensemble accuracy and the size of partitioned data subset. An ensemble framework of the algorithm is implemented and tested on 12 relatively big benchmark datasets. The experimental results indicate that it is able to identify the relation patterns accurately and efficiently in less than 10 steps. The identified patterns show that in most cases it is not necessary to use the whole big dataset for analysis as few smaller subsets are already sufficiently representative of the underlying problem, which is obviously a useful knowledge in big data analysis.
Original language | English |
---|---|
Pages | 48-55 |
Number of pages | 8 |
Publication status | Published - 20 Aug 2015 |
Event | IEEE Conference on Big Data - Helsinki, Finland Duration: 20 Aug 2015 → 22 Aug 2015 |
Conference
Conference | IEEE Conference on Big Data |
---|---|
Country/Territory | Finland |
City | Helsinki |
Period | 20/08/15 → 22/08/15 |
Keywords
- Big Data
- Machine Learning
- Data mining
- Ensemble Methods
Profiles
-
Wenjia Wang
- School of Computing Sciences - Professor of Artificial Intelligence
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching & Research