Overview of META-BOA


META-BOA (METAbolomics data Balancing with Over-sampling Algorithms) is a software solution for handling sample imbalance primarily for metabolomics and lipidomics datasets.

Class imbalance can greatly impede building of machine learning models and a number of methods have been devised for equalizing the number of samples in different classes by over-sampling minority class. META-BOA creates "synthetic" samples by implementing one of the following algorithms:

  1. SMOTE: Synthetic minority over-sampling technique
    SMOTE randomly generates new synthetic samples between the known minority class samples without replication (Chawla, et al., 2002).

  2. BSMOTE: Borderline synthetic minority over-sampling technique
    BSMOTE generates synthetic samples on the borderline between the majority and minority instances (Han, et al., 2005)

  3. ADASYN: Adaptive synthetic
    Distribution of synthetic samples created by ADASYN depends on the local distribution of samples in the minority class. ADASYN creates more samples in the neighborhood of minority samples that are in the vicinity of a larger number of the majority class cases (He, et al., 2008).

  4. ROSE: Random over-sampling examples
    ROSE is a bootstrap-based approach that creates synthetic samples in the neighbourhood of minority class features (Lunardon, et al., 2014).


User can select to additionally:

  1. Normalize data using log transform and/or Z-score normalization or Min-max scaling
  2. Visualize projection (PCA, t-SNE) of the original and balanced data, separately


Finally, for presentation of the effect of selected over-sampling on classification META-BOA will separately train a simple random forest classification model for both the imbalanced and balanced data. The results of these classification runs are reported using ROC curves and mean 10-fold CV accuracy. Comparison of results pre- and post-over-sampling provides information about the performance of the over-sampling method and its effect on the main variances and general data characteristics.


Reference


  1. Chawla, N.V., et al. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 2002;16(1):321–357.
  2. He, H., et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. p. 1322-1328.
  3. Han, H., Wang, W.-Y. and Mao, B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P. and Huang, G.-B., editors, Advances in Intelligent Computing. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005. p. 878-887.
  4. Lunardon, N., Menardi, G. and Torelli, N. ROSE: a Package for Binary Imbalanced Learning. The R Journal 2014;6:79-89.


META-BOA workflow.

Preparing your data for META-BOA


META-BOA input must be a single .CSV file with features (metabolites or lipids) in columns and samples in rows.


Column 1 defines Class label for each sample while Column 2 onwards includes Feature values. The file can include information for two or more sample classes (up to 10) with 10 or more samples in each group. Row 1 should provide Feature labels (e.g. metabolite names).


Over-sampling methods included in META-BOA cannot work with data that have missing values. Therefore, it is recommended that the user imputes any missing data prior to using META-BOA with a method that is most appropriate for their dataset. In the case that any missing values remain in the input, META-BOA will automatically impute them prior to over-sampling using the KNN method with Euclidian distances and 8 sample neighbours.


Sample Data


Provided example datasets include an example of a binary set, with only two sample classes (1) and example of a dataset with multiple classes (in this example 3 groups) (2).

  1. exampleBinarySet.csv
  2. exampleMultiGroup.csv



When troubleshooting, please review this list of common reasons for META-BOA failing to run. If you are still experiencing difficulties running our tool, please contact ldomic@uottawa.ca for further assistance. Please include your input dataset and a description of the problem that you experienced. We will reproduce the problem and provide you with a solution.


1. My file does not load or does not produce any results
META-BOA only accepts comma-delimited files as input. Tab-delimited or excel files will be read but will not produce any results. Please convert your input data into .csv format before running META-BOA. Your input file can only have sample information in the first column and feature labels in the first row. Data must start from row 2 and column 2 or META-BOA will not produce any results.


2. Missing values must be empty strings
META-BOA recognizes empty strings as missing values. NA values, null values, and whitespaces (single space) will produce an error when attempting to impute data. Please convert your missing value indicators to empty strings and/or strip your input file of whitespaces.


3. META-BOA accepts input datasets containing a maximum of 10 classes
META-BOA will balance datasets, via over-sampling, containing a maximum of 10 different classes. This includes control classes. Running META-BOA with a dataset containing 11 or more classes will produce an error.


4. My file does not produce any result only when selecting Log transformation in Step 2
For log transformation there cannot be any negative values in the dataset. Please make sure that that there are no negative values prior to uploading.


Contact Us

ldomic@uottawa.ca


Cite your use of META-BOA in a publication

Hashimoto-Roth E., Surendra A., Lavallée-Adam M., Bennett S.A.L., Cuperlovic-Culf M. Metabolomics Data Balancing with Over-sampling Algorithms (META-BOA): an online tool for addressing class imbalance in omics datasets.


Public Server

META-BOA: https://complimet.ca/shiny/meta-boa/


Software License

META-BOA is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License v3 (or later versions) as published by the Free Software Foundation. As per the GNU General Public License, META-BOA is distributed as a bioinformatic tool to assist users WITHOUT ANY WARRANTY and without any implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. All limitations of warranty are indicated in the GNU General Public License.

Step 1

Step 2

Download your balanced dataset


Download your data

Random Forest, receiver operating characteristic curves (ROC), Principal Component Analysis (PCA) and t-SNE show effect of over-sampling on data visualization and classification. For comparison shown are plots for the original and over-sampled data. For comparison shown are plots for the original and data following over-sampling. Plots are also included in the zipped downloadable file set.

Random Forest Classification

ROC before over-sampling


ROC after over-sampling


Data Visualization

PCA before over-sampling


PCA after over-sampling


t-SNE before over-sampling


t-SNE after over-sampling