Automated Statistical and Machine Learning Platform for Biological Research
Luke Rimmo Lego, Samantha Gauthier, Denver Jn. Baptiste
TL;DR
Biological research often suffers from fragmentation between statistical testing and predictive modeling tools. The authors introduce a browser-based platform that unifies exploratory data analysis, classical hypothesis tests, and Random Forest classification with automated preprocessing and adaptive hyperparameter optimization. The system features a modular architecture, automatic data handling, multiple evaluation metrics, and built-in statistical tests to maintain interpretability and statistical rigor. This integration promises accelerated, reproducible biological discovery and a scalable, extensible framework for future methodological expansions, released under MIT license.
Abstract
Research increasingly relies on computational methods to analyze experimental data and predict molecular properties. Current approaches often require researchers to use a variety of tools for statistical analysis and machine learning, creating workflow inefficiencies. We present an integrated platform that combines classical statistical methods with Random Forest classification for comprehensive data analysis that can be used in the biological sciences. The platform implements automated hyperparameter optimization, feature importance analysis, and a suite of statistical tests including t tests, ANOVA, and Pearson correlation analysis. Our methodology addresses the gap between traditional statistical software, modern machine learning frameworks and biology, by providing a unified interface accessible to researchers without extensive programming experience. The system achieves this through automatic data preprocessing, categorical encoding, and adaptive model configuration based on dataset characteristics. Initial testing protocols are designed to evaluate classification accuracy across diverse chemical datasets with varying feature distributions. This work demonstrates that integrating statistical rigor with machine learning interpretability can accelerate biological discovery workflows while maintaining methodological soundness. The platform's modular architecture enables future extensions to additional machine learning algorithms and statistical procedures relevant to bioinformatics.
