Overview of ProST

ProST Projection STatistics is a software solution for visualizing and analyzing high dimensional data projected into to two dimensions.


Projection of data for visualization of main trends is one of the first steps in data mining. A number of different methods have been developed for this task. However, with only a visual interpretation of the projection the result interpretation is lacking statistical rigour. ProST provides a user friendly application for several dimensional reduction methods:

  • Principal Component Analysis (PCA) [1]
  • t-distributed Stochastic Neighbor Embedding (t-SNE) [2]
  • Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [3]
  • Linear discriminant analysis (LDA) [4]

We provide statistical analysis of the level of separation of sample groups for each method. Users can select the preferred projection method and then choose between Mann-Whitney and t-test for the calculation of p-value for pairwise group separation. Projection as well as sample separation in first two dimensions are provided.

Additionally, prior to analysis user can choose between several different normalization and imputation methods or provide previously normalized data with no missing values.

Details of sample format are provided with example input file in the Download sample data tab.


References:

[1] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

[2] https://github.com/pavlin-policar/openTSNE

[3] https://github.com/lmcinnes/umap

[4] https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html


Preparing your data for ProST



ProST input must be a single .CSV file with features in columns and samples in rows.


The class label column must be named 'Class'.


We provide a subset of the MNIST dataset of handwritten digits as a sample dataset below. The dataset contains 2k samples and 784 dimensions (pixels). Analysis of this sample dataset will take ~5-60s, depending on the settings. The class label in this dataset is 'Label'.


Sample Dataset

Control Panel

All Components

Dimensional Reduction Method Parameters


See Parameter Help Page for more details.

No settings

Warning: Number of components must be <= min(# of classes - 1, # of feature). At least 3 classes and 2 features are required for LDA dimensionality reduction to work.

No settings

t-SNE

Early Exaggeration Coefficient

tSNE optimization generally occurs in two steps, starting with an early exaggeration phase where points are attracted much more strongly. The strength of this attraction exaggeration is controlled by the EE coefficient. The most commonly used value is 12 but tSNE may yield better dimensional reduction on smaller datasets with a value of 4, as prescribed in the original t-SNE paper.

(Early Exaggeration) Iterations

tSNE optimization occurs in two steps for a pre-determined number of iterations. The early exaggeration and 'normal', with the main difference between these two steps is the EE coefficient (1 for normal, >1 for EE) and the number of iterations.

Allowing early exaggeration to run long enough is vital for preserving global structure. It may be necessary to increase the EE iterations to improve the global structure when analyzing very large datasets. Increasing the number of normal iterations may improve local structure preservation as well.

Affinity Measure

The affinity between between points (which the optimization in tSNE is trying to preserve with its loss function) is determined using a distance measure; the option is provided to use cosine distance instead of euclidean distance which may be more appropriate in very high dimensional datasets.

Perplexity

Perplexity is related to the variance of the Gaussian distribution used to estimate the distance between points. Increasing the perplexity will cause the affinity calculation step to consider wider-ranging attractive forces. This can improve the global structure reconstruction at the cost of losing some local structure detail.

Changing the perplexity for small datasets can lead to qualitatively different dimensional reduction plots but when using the optimal learning rate (as used in ProSt), there does not appear to be an improvement.

Multiscale Perplexity

For very large datasets (where n >> 30000), it has been shown that global structure can is better preserved using multiscale perplexity. Checking this box will automatically select a second perplexity value (n/100), and calculate affinity using two gaussian distributions, each with different variance (30 and n/100).

This setting is not recommended for small datasets with fewer than 30000 datapoints.


UMAP

Nearest Neighbours

This parameter represents the balance between global and local structure preservation. You may increase the nearest neightbours to get better global structure preservation, at the risk of reduced local structure quality.

Minimum Distance

This is the minimum distance between points allowed by UMAP, the smaller it is, the closer UMAP will pack together similar points.

Troubleshooting ProST

When troubleshooting, please review this list of common reasons for ProST failing to run. If you are still experiencing difficulties running our tool, please contact ldomic@uottawa.ca for further assistance. Please include your input dataset and a description of the problem that you experienced. We will reproduce the problem and provide you with a solution.


1. My file does not load or does not produce any results

ProST only accepts comma-delimited files as input. Tab-delimited or excel files will be read but will not produce any results. Please convert your input data into .csv format before running ProST. Data must start from row 2 with name of Sample Class file provided by user. Sample information can be listed in any column, with all numeric value columns in the input used in the analysis.


2. Missing values must be empty strings

ProST recognizes empty strings as missing values. NA values, null values, and whitespaces (single space) will produce an error when attempting to impute data. Please convert your missing value indicators to empty strings and/or strip your input file of whitespaces.


3. ProST accepts input datasets containing a maximum of 10 classes

Although any number of classes is acceptable and will provide results visualization of violin plot and p-value table becomes difficult with more then 10 sample groups.


4. At least 2 samples per class are required for statistical analysis

Sample groups that have only 1 member will show in the projection plot but will not be used in statistical analysis of separation.


5. The Class/label column must be properly labelled as 'Class'

Improperly labelled class columns will prevent the user from filtering by class. Furthemore, improperly labelled class columns which contain all integers, will be included in the analysis and will affect the statistical results.

Contact us

ldomic@uottawa.ca


Cite the use of ProST in a publication

Danny Salem, Anuradha Surendra, Graeme SV McDowell1, Miroslava Čuperlović-Culf (2024) Projection Statistics – ProST – an online implementation of data dimensionality reduction methods with statistical assessment of group separation. {Link to article}


Public Server

ProST: https://complimet.ca/shiny/dev_site/ProST/


Software License

ProST is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License v3 (or later versions) as published by the Free Software Foundation. As per the GNU General Public License, ProST is distributed as a bioinformatic lipidomic tool to assist users WITHOUT ANY WARRANTY and without any implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. All limitations of warranty are indicated in the GNU General Public License.