Overview of ProST
ProST Projection STatistics is a software solution for visualizing and analyzing high dimensional data projected into to two dimensions.
Projection of data for visualization of main trends is one of the first steps in data mining. A number of different methods have been developed for this task. However, with only a visual interpretation of the projection the result interpretation is lacking statistical rigour. ProST provides a user friendly application for several dimensional reduction methods:
- Principal Component Analysis (PCA) [1]
- t-distributed Stochastic Neighbor Embedding (t-SNE) [2]
- Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [3]
- Linear discriminant analysis (LDA) [4]
We provide statistical analysis of the level of separation of sample groups for each method. Users can select the preferred projection method and then choose between Mann-Whitney and t-test for the calculation of p-value for pairwise group separation. Projection as well as sample separation in first two dimensions are provided.
Additionally, prior to analysis user can choose between several different normalization and imputation methods or provide previously normalized data with no missing values.
Details of sample format are provided with example input file in the Download sample data tab.
References:
[1] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[2] https://github.com/pavlin-policar/openTSNE
[3] https://github.com/lmcinnes/umap
Preparing your data for ProST
ProST input must be a single .CSV file with features in columns and samples in rows.
The class label column must be named 'Class'.
We provide a subset of the MNIST dataset of handwritten digits as a sample dataset below. The dataset contains 2k samples and 784 dimensions (pixels). Analysis of this sample dataset will take ~5-60s, depending on the settings. The class label in this dataset is 'Label'.
Sample Dataset
Control Panel
Dimensional Reduction Method Parameters
See Parameter Help Page for more details.
No settings
Warning: Number of components must be <= min(# of classes - 1, # of feature). At least 3 classes and 2 features are required for LDA dimensionality reduction to work.
No settings
t-SNE
Early Exaggeration Coefficient
tSNE optimization generally occurs in two steps, starting with an early exaggeration phase where points are attracted much more strongly. The strength of this attraction exaggeration is controlled by the EE coefficient. The most commonly used value is 12 but tSNE may yield better dimensional reduction on smaller datasets with a value of 4, as prescribed in the original t-SNE paper.
(Early Exaggeration) Iterations
tSNE optimization occurs in two steps for a pre-determined number of iterations. The early exaggeration and 'normal', with the main difference between these two steps is the EE coefficient (1 for normal, >1 for EE) and the number of iterations.
Allowing early exaggeration to run long enough is vital for preserving global structure. It may be necessary to increase the EE iterations to improve the global structure when analyzing very large datasets. Increasing the number of normal iterations may improve local structure preservation as well.
Affinity Measure
The affinity between between points (which the optimization in tSNE is trying to preserve with its loss function) is determined using a distance measure; the option is provided to use cosine distance instead of euclidean distance which may be more appropriate in very high dimensional datasets.
Perplexity
Perplexity is related to the variance of the Gaussian distribution used to estimate the distance between points. Increasing the perplexity will cause the affinity calculation step to consider wider-ranging attractive forces. This can improve the global structure reconstruction at the cost of losing some local structure detail.
Changing the perplexity for small datasets can lead to qualitatively different dimensional reduction plots but when using the optimal learning rate (as used in ProSt), there does not appear to be an improvement.
Multiscale Perplexity
For very large datasets (where n >> 30000), it has been shown that global structure can is better preserved using multiscale perplexity. Checking this box will automatically select a second perplexity value (n/100), and calculate affinity using two gaussian distributions, each with different variance (30 and n/100).
This setting is not recommended for small datasets with fewer than 30000 datapoints.
UMAP
Nearest Neighbours
This parameter represents the balance between global and local structure preservation. You may increase the nearest neightbours to get better global structure preservation, at the risk of reduced local structure quality.
Minimum Distance
This is the minimum distance between points allowed by UMAP, the smaller it is, the closer UMAP will pack together similar points.
Troubleshooting ProST
When troubleshooting, please review this list of common reasons for ProST failing to run. If you are still experiencing difficulties running our tool, please contact ldomic@uottawa.ca for further assistance. Please include your input dataset and a description of the problem that you experienced. We will reproduce the problem and provide you with a solution.
1. My file does not load or does not produce any results
ProST only accepts comma-delimited files as input. Tab-delimited or excel files will be read but will not produce any results. Please convert your input data into .csv format before running ProST. Data must start from row 2 with name of Sample Class file provided by user. Sample information can be listed in any column, with all numeric value columns in the input used in the analysis.
2. Missing values must be empty strings
ProST recognizes empty strings as missing values. NA values, null values, and whitespaces (single space) will produce an error when attempting to impute data. Please convert your missing value indicators to empty strings and/or strip your input file of whitespaces.
3. ProST accepts input datasets containing a maximum of 10 classes
Although any number of classes is acceptable and will provide results visualization of violin plot and p-value table becomes difficult with more then 10 sample groups.
4. At least 2 samples per class are required for statistical analysis
Sample groups that have only 1 member will show in the projection plot but will not be used in statistical analysis of separation.
5. The Class/label column must be properly labelled as 'Class'
Improperly labelled class columns will prevent the user from filtering by class. Furthemore, improperly labelled class columns which contain all integers, will be included in the analysis and will affect the statistical results.
Contact us
Cite the use of ProST in a publication
Danny Salem, Anuradha Surendra, Graeme SV McDowell1, Miroslava Čuperlović-Culf (2024) Projection Statistics – ProST – an online implementation of data dimensionality reduction methods with statistical assessment of group separation. {Link to article}
Public Server
ProST: https://complimet.ca/shiny/dev_site/ProST/
Software License
ProST is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License v3 (or later versions) as published by the Free Software Foundation. As per the GNU General Public License, ProST is distributed as a bioinformatic lipidomic tool to assist users WITHOUT ANY WARRANTY and without any implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. All limitations of warranty are indicated in the GNU General Public License.