Towards Feature Selection for Disk-Based Multirelational Learners: A Case Study with a Boosting Algorithm

S. Hoche, S. Wrobel

Research output: Contribution to conferencePaper

Abstract

Feature selection is an important issue for any learning algorithm, since reduced feature sets lead to an improvement in learning time, reduced model complexity and, in many cases, a reduced risk of overfitting. When performing feature selection for RAM-based learning algorithms, we typically assume that the cost of accessing each feature is uniform. In multirelational data mining, especially when data are to be held in a relational database management system (RDBMS), this is no longer the case. The dominant cost in such a setting is the scan of a relation, so that the cost of using a feature from a relation that needs to be scanned anyway is comparatively small, whereas adding a feature from a relation that has not been used before is high. This means that existing work on feature selection using the uniform cost assumption may not be applicable in a disk-based setting. In this paper, we report the results of a case study that extends prior work on multirelational feature selection, in particular, in the context of a boosting algorithm. As shown by our study, using the previously developed strategies on average leads to larger numbers of relations that need to be considered and loaded into memory, and thus higher cost in a disk-based setting. Instead, a simple relation-oriented strategy can be used to minimize cost of accessing additional relations. We describe experimental results to show how this basic strategy interacts with the feature selection variants proposed previously, and show that significant gains are made even in a main-memory setting.
Original languageEnglish
Pages30-43
Number of pages14
Publication statusPublished - Aug 2003
Event2nd Workshop on Multi-Relational Data Mining - Washington DC, United States
Duration: 24 Aug 200327 Aug 2003

Workshop

Workshop2nd Workshop on Multi-Relational Data Mining
Abbreviated titleMRDM-2003
Country/TerritoryUnited States
CityWashington DC
Period24/08/0327/08/03

Cite this