Datamining Exploration Software™ (DES™)

DES™ is part of the MBS™ bioinformatics package!

DES™ perform data mining procedures, such as extracting patterns from data. As more data are gathered by mass spectrometer and the amount of data doubling for every three years, data mining is becoming an increasingly important tool to transform these data into information. DES™ performs the necessary procedures to extract valuable information from large MALDI or SELDI datasets.


Module 4: Data Analysis

Sub-Module 1: Data Transformation

Transformation of original data is usually applied (prior to actual analysis) in order to make data meet assumptions of statistical procedures and/or to improve interpretability of graphs.

Available transformations are:

  • Logarithmic
  • Negative Logarithmic
  • Exponential
  • Power Transformation
  • Normalization
  • Robust Normalization


Sub-Module 2: One-Dimension Data Analysis

In one-dimensional (descriptive) data analysis, popular summary statistics are used to summarize the basic properties of a single (selected) feature (M/Z). The most commonly used measures of central tendency and measures of variability are computed.

Descriptive statistics available:

  • Measures of central tendency: mean, median, trimmed mean, minimum and maximum
  • Measures of variability: standard deviation (std) and variance, sample range, interquartile range (IQR), median absolute deviation (MAD)
  • Empirical quantiles: the first and the third quartile (Q1, Q3)

All summary statistics are computed in two variants:

  • Simple (computed for all spectra)
  • Categorized in groups (computed for "control" and "disease" spectra separately)

Results:

  • Statistics values:
    • simple summary statistics (derived on the basis of all spectra)
    • summary statistics for the disease group (i.e. derived using only "disease" spectra)
    • summary statistics for the control group (i.e. derived using only "control" spectra)
  • Boxplot (multiple box-plot derived for all spectra and for "disease" and "control" groups, separately)
  • Histogram (multiple histogram plot derived for all spectra and for "disease" and "control" groups, separately)


Sub-Module 3: Multi-Dimension Data Analysis

The aim of multi-dimensional data visualization is to provide a graphical summary of relationships (similarities) between samples (spectra) or between features (M/Z's) in a given dataset. Results are displayed using easy to interpret heatmaps, that show correlations between samples or features, accordingly.

Available tools:

  • Heatmaps:
    • correlation between samples (cases)
    • correlation between features (M/Z's)

In the case of correlation between features, in order to facilitate the analysis, a smaller number of best features is selected using AUC (Area Under ROC Curve) statistics.

Results:

  • Heat map (showing in graphical form the correlation matrix between samples or between features. For correlation between features only 30 of the best features, with respect to AUC statistic, are shown.)


Sub-Module 4: Biomarker Detection

Identification of disease-specific biomarkers is one of the major topics of proteomics providing opportunities to develop and validate new therapeutic or diagnostic strategies.

The biomarker detection analysis available in the software allows for the examination of discriminative properties of all features. Ability to discriminate "disease" or "control" groups is assessed with the aid of a selected separation measure.

The following separation measures are available:

  • Divergence
  • Fisher score
  • SAM score
  • T-test (P-values)
  • Kolmogorov-Smirnov statistics (K-S statistics)
  • T-score
  • WMW test (Wilcoxon-Mann-Whitney test)
  • AUC (Area Under ROC curve)

The result is reported as a feature ranking showing the selected number of best features, i.e. features that exhibit the largest discriminative power. The obtained ranking can be further used to identify potential biomarkers.

Results:

  • Ranking Chart (bar plot presenting both names and scores corresponding to the selected 'best features')
  • Criterion Values:
    • selected (best) features
    • selected (best) features scores (i.e. values of the separation measure)


Sub-Module 5: Dimension Reduction

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used feature extraction and visualization technique that transforms original features into a smaller number of important directions called principal components (PCs). The first principal component accounts for as much of data variability as possible, the second component (orthogonal to the first one) accounts for as much of the remaining variability as possible, etc.

Results of PCA analysis are presented as:

  • 2D scatterplot (scatterplot displaying 1st against 2nd principal components. Two variants are shown, i.e. simple and categorized scatterplot, accordingly. For the latter, information on "disease" or "control" group membership is used)
  • Scree plot (a simple bar plot showing the fraction of total variance in the data as explained or represented by consecutive principal components)

Results:

  • Scree Plot (scree-plot showing fraction of the total variance in data explained by each principal components)
  • Cumulative Scree Plot (scree-plot showing cumulative fraction of the total variance in data explained by the first few principal components)
  • Scatterplot (2D PCA scatterplot showing 1st principal component vs. 2nd principal component)
  • Multiple Scatterplot:
    • simple 2D PCA scatterplot ("disease" and "control" groups are not distinguished)
    • categorized 2D PCA scatterplot ("disease" and "control" groups for spectra are distinguished)
  • Loading Scatterplot
  • Loading Bar Plot
  • Loading Line Plot
  • Score-Loading Scatterplots (PCA scatterplot showing cases and PCA loading scatterplot showing features)
  • Variance Explained:
    • names of consecutive principal components
    • explained variance represented by consecutive principal components
  • Loading Values

Partial Least Squares (PLS)

Partial Least Squares (PLS), also known as projection to latent structures by means of partial least squares, finds linear combinations of original features (predictors) whose correlations with the response (i.e. disease status) is maximized, and which are mutually uncorrelated.

Results:

  • Score Scatterplot
  • Loading Scatterplot
  • Loading Values
  • Loading Line Plot
  • Loading Bar Plot
  • Variable Importance
  • Variable Importance Plot
  • Model Coefficients
  • Model Performance Plot
  • Confusion Matrix
  • Model Performance Statistics
  • ROC Chart


Sub-Module 6: Classification

Mass spectrometry-based proteomics is a powerful technology used in studies concerned with classification of disease states. The protein samples from disease patients and control (i.e. non-disease) patients are analyzed through MS instruments and the resulting MS patterns are used to build a classifier.

In the software binary classification of disease states, i.e. classification of samples (spectra) to either disease or control group can be performed. Currently available classification algorithms include:

  • k-NN (k-Nearest Neighbours algorithm)
  • LDA (Linear Discriminant Analysis)
  • SVM (Support Vector Machines)
  • Decision Tree (Classification Trees)
  • Partial Least Squares (PLS)

Additionally, feature selection step (preceding the actual classification) is included in the analysis. A subset of best features can be selected using a given separation measure, including: divergence, Fisher score, SAM score, T-test, Kolmogorov-Smirnov (KS) statistics, T-score, Wilcoxon-Mann-Whitney (WMW) test and AUC.

Classifier performance is assessed using standard learning/test set random split scheme and a number of accuracy measures. The following performance measures are used:

  • Confusion matrix: TP, FP, TN, FN
  • Misclassification error, sensitivity, specificity
  • ROC curve

The performance results are presented for learning and test sets separately.

Results:

  • ROC Chart (ROC curves for the predicted scores, derived for learning and test sets, accordingly)
  • Confusion Matrix (TP, FN, FP, TN computed for the learning and test sets, separately)
  • Performance Statistics:
    • names of the performance measures/statistics derived
    • values of performance measures obtained for the learning set
    • values of performance measures obtained for the test set
  • Selected Features (names of the best features that are used to construct classifier)
  • Removed Features (names of the removed features; Note that, this is applicable only for LDA classifier, where all low-variance features have to be removed prior to fitting classifier


Sub-Module 7: Cluster Analysis

Cluster Analysis (or clustering) is used to create homogeneous groups of objects (clusters), where objects in one cluster are similar to each other and objects in different clusters are quite distinct.

In order to find clusters in data the popular k-means clustering algorithms is used. The whole range for a number of clusters can be investigated in order to find an optimal one.

For each number of clusters (considered in the analysis), evaluation of clustering quality is performed using a selected validation index. Currently available clustering validation indices include:

  • Silhouette index
  • Davies-Bouldin index
  • Dunn's index
  • C index

Result of the clustering analysis are displayed as a simple line plot showing values of the validation index against number of clusters.

Results:

  • Validation Index Plot (line plot showing value of the validation index against number of clusters)
  • Validation Index Values (value of the validation index obtained for a given number of clusters)
  • Optimal Cluster Assignment (information about optimal clustering partition obtained (i.e. information on which spectra belong to which cluster)), derived using the optimal number of clusters found with the aid of a simple/heuristic algorithm
  • Optimal Clusters (simple characteristics of an optimal clustering partition)

MBS™ for Proteomics

MBS™ is a flexible bioinformatics tool that was created for different tasks for Mass Spectrometry data analysis. The software tool contains many different algorithms created by different experts in Mass Spectrometry Proteomics and Statistics.

MBS™ contains the software tool PDS™ for Peak Extraction and Peak Detection, and the software tool DES™ for Data-Mining and Pattern Extraction, both used for MALDI Proteomics Data Analysis.

MBS™ contains the software tool PAS™ for Pair Extraction, De Novo Sequencing, Detection of Phosphorylated Peptides, and Detecting Known and Unknown Peptide Modifications, used for TANDEM MS (MS/MS) Data Analysis.