Advanced statistical techniques and data mining methods have been recognized as a powerful support for mass spectrometry (MS) data analysis. Particularly, due to its unsupervised learning nature, data clustering methods have attracted increasing interest for exploring, identifying, and discriminating pathological cases from MS clinical samples. Supporting biomarker discovery in protein profiles has drawn special attention from biologists and clinicians. However, the huge amount of information contained in a single sample, that is, the high-dimensionality of MS data makes the effective identification of biomarkers a challenging problem. In this paper, we present a data mining approach for the analysis of MS data, in which the mining phase is developed as a task of clustering of MS data. Under the natural assumption of modeling MS data as time series, we propose a new representation model of MS data which allows for significantly reducing the high-dimensionality of such data, while preserving the relevant features. Besides the reduction of high-dimensionality (which typically affects effectiveness and efficiency of computational methods), the proposed representation model of MS data also alleviates the critical task of preprocessing the raw spectra in the whole process of MS data analysis. We evaluated our MS data clustering approach to publicly available proteomic datasets, and experimental results have shown the effectiveness of the proposed approach that can be used to aid clinicians in studying and formulating diagnosis of pathological states.
A time series approach for clustering mass spectrometry data
TAGARELLI, Andrea;VELTRI P.
2012-01-01
Abstract
Advanced statistical techniques and data mining methods have been recognized as a powerful support for mass spectrometry (MS) data analysis. Particularly, due to its unsupervised learning nature, data clustering methods have attracted increasing interest for exploring, identifying, and discriminating pathological cases from MS clinical samples. Supporting biomarker discovery in protein profiles has drawn special attention from biologists and clinicians. However, the huge amount of information contained in a single sample, that is, the high-dimensionality of MS data makes the effective identification of biomarkers a challenging problem. In this paper, we present a data mining approach for the analysis of MS data, in which the mining phase is developed as a task of clustering of MS data. Under the natural assumption of modeling MS data as time series, we propose a new representation model of MS data which allows for significantly reducing the high-dimensionality of such data, while preserving the relevant features. Besides the reduction of high-dimensionality (which typically affects effectiveness and efficiency of computational methods), the proposed representation model of MS data also alleviates the critical task of preprocessing the raw spectra in the whole process of MS data analysis. We evaluated our MS data clustering approach to publicly available proteomic datasets, and experimental results have shown the effectiveness of the proposed approach that can be used to aid clinicians in studying and formulating diagnosis of pathological states.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.