A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded

John Wood, Ian R. White, Paul Cutler

    Research output: Contribution to journalArticlepeer-review

    17 Citations (Scopus)

    Abstract

    In several aspects of science it is important to be able to discriminate statistically relevant changes from background noise in complex data sets. Often this must be done where a significant proportion of the data may be missing for technical or operational issues. An example of such a challenge is proteomic data, where the aim is to assess differences in protein expression between groups, over several thousand proteins and across large sample sets. Such comparisons require firstly the digitisation of two-dimensional gel electrophoresis (2-DE) images followed by image analysis to create a matched table of thousands of quantitated spots, usually as density volumes. Often this is non-ideal as the analysis software fails to accurately detect, match or co-register spots across the data set resulting in incomplete representation of data points. This is important, as the so-called “missing data” cannot be ignored. It is not possible to say whether the spot data are missing as a result of being below the limit of detection, in which case a background value or similar nominal level can be assigned, or because of a failure to match, in which case such an interpolated value would be in error. By virtue of the fact that in proteomic analysis the assay is determining values for thousands of proteins simultaneously, there is the extra complication of “multiplicity”. Multiplicity reflects the fact that by measuring so many elements at once, some will appear to have altered expression purely as a result of chance. Statistical significance tests cannot therefore be considered definitive unless this issue is addressed. We describe the development of a statistical approach to data analysis for direct group comparisons, which deals with both missing data and multiplicity. Although the example given here is proteomic data, this novel approach could be used to compare the means of any two distributions where missing data can neither be disregarded nor set to zero, but where the probability that data will be missing rises as ‘true’ values get smaller.
    Original languageEnglish
    Pages (from-to)1777-1788
    Number of pages12
    JournalSignal Processing
    Volume84
    Issue number10
    DOIs
    Publication statusPublished - 2004

    Cite this