An Algorithm for Identifying the Learning Patterns in Big Data

Majed Farrash, Wenjia Wang

Research output: Contribution to conferencePaperpeer-review

1 Citation (Scopus)

Abstract

Divide-and-Conquer is probably the most commonly used strategy to deal with a big data that is too big to be loaded into any computing systems memory as a whole for analysis. It partitions such a big dataset into many smaller subsets that can be loaded into computer memory separately to induce models, which can be combined by machine learning ensemble methods. However, it is not clear that how the size of subsets may affect the learning performance of individual models and their ensemble. This paper proposes an ensemble based algorithm to quickly detect their relational patterns in terms of ensemble accuracy and the size of partitioned data subset. An ensemble framework of the algorithm is implemented and tested on 12 relatively big benchmark datasets. The experimental results indicate that it is able to identify the relation patterns accurately and efficiently in less than 10 steps. The identified patterns show that in most cases it is not necessary to use the whole big dataset for analysis as few smaller subsets are already sufficiently representative of the underlying problem, which is obviously a useful knowledge in big data analysis.
Original languageEnglish
Pages48-55
Number of pages8
Publication statusPublished - 20 Aug 2015
EventIEEE Conference on Big Data - Helsinki, Finland
Duration: 20 Aug 201522 Aug 2015

Conference

ConferenceIEEE Conference on Big Data
Country/TerritoryFinland
CityHelsinki
Period20/08/1522/08/15

Keywords

  • Big Data
  • Machine Learning
  • Data mining
  • Ensemble Methods

Cite this