Simulation-based sample size estimation for classification

For classification (such as stratifying different phenotype), simulation-based analysis is utilized to estimate the prediction performance under different size of training data, including Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. The function fits the intensity-based linear model on the input preliminary data and uses the estimated variance components to simulate new training data with different protein numbers and sample sizes. Random forest model is fitted on train data and used to predict the input preliminary data. The above procedure is repeated several times. Mean predictive accuracy and variance under the different sizes of training data are reported.


From Bioconductor: MSstatsSampleSize

MSstatsSampleSize 1.0.0 (Bioconductor version : Release 3.10, R version >= 3.6)

Type the following in R console window

if (!requireNamespace("BiocManager", quietly = TRUE))

From GitHub: MSstatsSampleSize

MSstatsTMT Bioconductor Development version : link
The development version of the package MSstats is the most recent and is available here. The versioning of the main package is updated twice a year, to synchronize with the Bioconductor release.


  • Ting Huang,  Northeastern University
  • Meena Choi,  Northeastern University

Citing MSstatsTMT

  • R package : Huang T, Choi M, Vitek O (2019). MSstatsSampleSize: Simulation tool for optimal design of high-dimensional MS-based proteomics experiment. R package version 1.0.0, DOI:10.18129/B9.bioc.MSstatsSampleSize
  • manuscripts in preparation