TY - JOUR

T1 - A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded

AU - Wood, John

AU - White, Ian R.

AU - Cutler, Paul

PY - 2004

Y1 - 2004

N2 - In several aspects of science it is important to be able to discriminate statistically relevant changes from background noise in complex data sets. Often this must be done where a significant proportion of the data may be missing for technical or operational issues. An example of such a challenge is proteomic data, where the aim is to assess differences in protein expression between groups, over several thousand proteins and across large sample sets.
Such comparisons require firstly the digitisation of two-dimensional gel electrophoresis (2-DE) images followed by image analysis to create a matched table of thousands of quantitated spots, usually as density volumes. Often this is non-ideal as the analysis software fails to accurately detect, match or co-register spots across the data set resulting in incomplete representation of data points. This is important, as the so-called “missing data” cannot be ignored. It is not possible to say whether the spot data are missing as a result of being below the limit of detection, in which case a background value or similar nominal level can be assigned, or because of a failure to match, in which case such an interpolated value would be in error. By virtue of the fact that in proteomic analysis the assay is determining values for thousands of proteins simultaneously, there is the extra complication of “multiplicity”. Multiplicity reflects the fact that by measuring so many elements at once, some will appear to have altered expression purely as a result of chance. Statistical significance tests cannot therefore be considered definitive unless this issue is addressed. We describe the development of a statistical approach to data analysis for direct group comparisons, which deals with both missing data and multiplicity. Although the example given here is proteomic data, this novel approach could be used to compare the means of any two distributions where missing data can neither be disregarded nor set to zero, but where the probability that data will be missing rises as ‘true’ values get smaller.

AB - In several aspects of science it is important to be able to discriminate statistically relevant changes from background noise in complex data sets. Often this must be done where a significant proportion of the data may be missing for technical or operational issues. An example of such a challenge is proteomic data, where the aim is to assess differences in protein expression between groups, over several thousand proteins and across large sample sets.
Such comparisons require firstly the digitisation of two-dimensional gel electrophoresis (2-DE) images followed by image analysis to create a matched table of thousands of quantitated spots, usually as density volumes. Often this is non-ideal as the analysis software fails to accurately detect, match or co-register spots across the data set resulting in incomplete representation of data points. This is important, as the so-called “missing data” cannot be ignored. It is not possible to say whether the spot data are missing as a result of being below the limit of detection, in which case a background value or similar nominal level can be assigned, or because of a failure to match, in which case such an interpolated value would be in error. By virtue of the fact that in proteomic analysis the assay is determining values for thousands of proteins simultaneously, there is the extra complication of “multiplicity”. Multiplicity reflects the fact that by measuring so many elements at once, some will appear to have altered expression purely as a result of chance. Statistical significance tests cannot therefore be considered definitive unless this issue is addressed. We describe the development of a statistical approach to data analysis for direct group comparisons, which deals with both missing data and multiplicity. Although the example given here is proteomic data, this novel approach could be used to compare the means of any two distributions where missing data can neither be disregarded nor set to zero, but where the probability that data will be missing rises as ‘true’ values get smaller.

U2 - 10.1016/j.sigpro.2004.06.019

DO - 10.1016/j.sigpro.2004.06.019

M3 - Article

VL - 84

SP - 1777

EP - 1788

JO - Signal Processing

JF - Signal Processing

SN - 0165-1684

IS - 10

ER -