TY - JOUR
T1 - Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)
AU - Defernez, Marianne
AU - Kemsley, E. Katherine
PY - 1999
Y1 - 1999
N2 - Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.
AB - Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.
UR - http://www.scopus.com/inward/record.url?scp=0032702796&partnerID=8YFLogxK
U2 - 10.1039/A905556H
DO - 10.1039/A905556H
M3 - Article
C2 - 26114398
AN - SCOPUS:0032702796
VL - 124
SP - 1675
EP - 1681
JO - Analyst
JF - Analyst
SN - 0003-2654
ER -