UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems
Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang
TL;DR
UBio-MolFM addresses the scale-accuracy gap in biomolecular simulation by combining a biologically informed dataset (UBio-Mol26), a hardware-efficient equivariant transformer (E2Former-V2), and a three-stage curriculum that enforces energy–force consistency. The model demonstrates ab initio-like fidelity for large systems up to around $1500$ atoms and delivers competitive microscopic accuracy while significantly improving MD throughput on large biomolecules, including water, salts, peptides, and RNA coordination with Mg$^{2+}$. However, imbalances in biological data coverage introduce domain-specific trade-offs (e.g., nucleic acid ΔE performance), guiding future work toward balanced top-down sampling and broader validation. The work provides a practical path toward high-fidelity, scalable molecular simulations and plans open releases of data, code, and weights to catalyze community adoption and further development in executable biology.
Abstract
All-atom molecular simulation serves as a quintessential ``computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio-MolFM introduces three synergistic innovations: (1) UBio-Mol26, a large bio-specific dataset constructed via a multi-fidelity ``Two-Pronged Strategy'' that combines systematic bottom-up enumeration with top-down sampling of native protein environments (up to 1,200 atoms); (2) E2Former-V2, a linear-scaling equivariant transformer that integrates Equivariant Axis-Aligned Sparsification (EAAS) and Long-Short Range (LSR) modeling to capture non-local physics with up to ~4x higher inference throughput in our large-system benchmarks; and (3) a Three-Stage Curriculum Learning protocol that transitions from energy initialization to energy-force consistency, with force-focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables -- including liquid water structure, ionic solvation, and peptide folding -- demonstrates that UBio-MolFM achieves ab initio-level fidelity on large, out-of-distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio-MolFM provides a robust, ready-to-use tool for the next generation of computational biology.
