A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded

John Wood, Ian R. White, Paul Cutler

Research output: Contribution to journalArticlepeer-review

17 Citations (Scopus)


In several aspects of science it is important to be able to discriminate statistically relevant changes from background noise in complex data sets. Often this must be done where a significant proportion of the data may be missing for technical or operational issues. An example of such a challenge is proteomic data, where the aim is to assess differences in protein expression between groups, over several thousand proteins and across large sample sets. Such comparisons require firstly the digitisation of two-dimensional gel electrophoresis (2-DE) images followed by image analysis to create a matched table of thousands of quantitated spots, usually as density volumes. Often this is non-ideal as the analysis software fails to accurately detect, match or co-register spots across the data set resulting in incomplete representation of data points. This is important, as the so-called “missing data” cannot be ignored. It is not possible to say whether the spot data are missing as a result of being below the limit of detection, in which case a background value or similar nominal level can be assigned, or because of a failure to match, in which case such an interpolated value would be in error. By virtue of the fact that in proteomic analysis the assay is determining values for thousands of proteins simultaneously, there is the extra complication of “multiplicity”. Multiplicity reflects the fact that by measuring so many elements at once, some will appear to have altered expression purely as a result of chance. Statistical significance tests cannot therefore be considered definitive unless this issue is addressed. We describe the development of a statistical approach to data analysis for direct group comparisons, which deals with both missing data and multiplicity. Although the example given here is proteomic data, this novel approach could be used to compare the means of any two distributions where missing data can neither be disregarded nor set to zero, but where the probability that data will be missing rises as ‘true’ values get smaller.
Original languageEnglish
Pages (from-to)1777-1788
Number of pages12
JournalSignal Processing
Issue number10
Publication statusPublished - 2004

Cite this