Table of Contents
Fetching ...

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang

TL;DR

UBio-MolFM addresses the scale-accuracy gap in biomolecular simulation by combining a biologically informed dataset (UBio-Mol26), a hardware-efficient equivariant transformer (E2Former-V2), and a three-stage curriculum that enforces energy–force consistency. The model demonstrates ab initio-like fidelity for large systems up to around $1500$ atoms and delivers competitive microscopic accuracy while significantly improving MD throughput on large biomolecules, including water, salts, peptides, and RNA coordination with Mg$^{2+}$. However, imbalances in biological data coverage introduce domain-specific trade-offs (e.g., nucleic acid ΔE performance), guiding future work toward balanced top-down sampling and broader validation. The work provides a practical path toward high-fidelity, scalable molecular simulations and plans open releases of data, code, and weights to catalyze community adoption and further development in executable biology.

Abstract

All-atom molecular simulation serves as a quintessential ``computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio-MolFM introduces three synergistic innovations: (1) UBio-Mol26, a large bio-specific dataset constructed via a multi-fidelity ``Two-Pronged Strategy'' that combines systematic bottom-up enumeration with top-down sampling of native protein environments (up to 1,200 atoms); (2) E2Former-V2, a linear-scaling equivariant transformer that integrates Equivariant Axis-Aligned Sparsification (EAAS) and Long-Short Range (LSR) modeling to capture non-local physics with up to ~4x higher inference throughput in our large-system benchmarks; and (3) a Three-Stage Curriculum Learning protocol that transitions from energy initialization to energy-force consistency, with force-focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables -- including liquid water structure, ionic solvation, and peptide folding -- demonstrates that UBio-MolFM achieves ab initio-level fidelity on large, out-of-distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio-MolFM provides a robust, ready-to-use tool for the next generation of computational biology.

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

TL;DR

UBio-MolFM addresses the scale-accuracy gap in biomolecular simulation by combining a biologically informed dataset (UBio-Mol26), a hardware-efficient equivariant transformer (E2Former-V2), and a three-stage curriculum that enforces energy–force consistency. The model demonstrates ab initio-like fidelity for large systems up to around atoms and delivers competitive microscopic accuracy while significantly improving MD throughput on large biomolecules, including water, salts, peptides, and RNA coordination with Mg. However, imbalances in biological data coverage introduce domain-specific trade-offs (e.g., nucleic acid ΔE performance), guiding future work toward balanced top-down sampling and broader validation. The work provides a practical path toward high-fidelity, scalable molecular simulations and plans open releases of data, code, and weights to catalyze community adoption and further development in executable biology.

Abstract

All-atom molecular simulation serves as a quintessential ``computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio-MolFM introduces three synergistic innovations: (1) UBio-Mol26, a large bio-specific dataset constructed via a multi-fidelity ``Two-Pronged Strategy'' that combines systematic bottom-up enumeration with top-down sampling of native protein environments (up to 1,200 atoms); (2) E2Former-V2, a linear-scaling equivariant transformer that integrates Equivariant Axis-Aligned Sparsification (EAAS) and Long-Short Range (LSR) modeling to capture non-local physics with up to ~4x higher inference throughput in our large-system benchmarks; and (3) a Three-Stage Curriculum Learning protocol that transitions from energy initialization to energy-force consistency, with force-focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables -- including liquid water structure, ionic solvation, and peptide folding -- demonstrates that UBio-MolFM achieves ab initio-level fidelity on large, out-of-distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio-MolFM provides a robust, ready-to-use tool for the next generation of computational biology.
Paper Structure (64 sections, 10 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 64 sections, 10 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: The UBio-MolFM Framework. Our approach bridges the scale-accuracy gap through three synergistic pillars: (1) Data: The UBio-Mol26 dataset, constructed via a Two-Pronged Strategy where a bottom-up branch systematically enumerates biochemical building blocks and a top-down branch samples native environments from large protein assemblies; (2) Model: The E2Former-V2 architecture, which achieves linear memory scaling and up to $\sim$4$\times$ higher inference throughput on large systems in our benchmark via Equivariant Axis-Aligned Sparsification (EAAS) and Long--Short Range (LSR) modeling; and (3) Training: A Three-Stage Curriculum Learning protocol progressing from energy initialization to energy--force consistency and multi-fidelity fine-tuning. (4) Key Results: The bottom panels illustrate the application to Cyclosporine A (CsA), where H-bond distances of key residues (water on the left, vacuum on the right) indicate stable maintenance of solvent-dependent open and closed conformations.
  • Figure 2: Trajectory Analysis of Potential Energy Differences. Comparison of predicted vs. DFT energy changes along the longest trajectories from each benchmark category. Absolute values of energy changes ($|\Delta E|$) are plotted on a logarithmic scale. UBio-MolFM (S3) exhibits superior alignment with ground truth fluctuations, while the S2 base consistently demonstrates the robust inductive bias of the E2Former-V2 architecture in tracking macromolecular dynamics.
  • Figure 3: Structural Fidelity of Liquid Water. Comparison of oxygen-oxygen radial distribution functions (O--O RDF) derived from (a) UBio-MolFM (S3) and (b) UMA-S-1p1 $NVT$ trajectories against experimental references skinner2014structurechen2016ab. The two models are closely matched; UMA is slightly closer in the first-peak position, while UBio-MolFM shows slightly sharper peak definition.
  • Figure 4: Hydration Structure of 0.15 mol/L NaCl Solution. Radial distribution functions (RDF) for Na--O, Cl--O, and Na--Cl pairs from a 200 ps $NVT$ simulation. The corresponding coordination numbers (CN) for the first hydration shells of Na$^+$ and Cl$^-$ are marked.
  • Figure 5: Environmental Dependence of Cyclosporine A Conformations. (a) In the aqueous trajectory, intramolecular H-bond distances for key residues remain large, indicating that the initial open state is stably maintained. (b) This stability is driven by consistent hydration, as shown by the high occupancy of hydrogen bonds between the peptide and solvent oxygens. (c) In contrast, the vacuum trajectory shows stable maintenance of the closed state, with internal H-bond distances remaining below 2.5 Å. UBio-MolFM captures these solvent-driven conformational preferences.
  • ...and 10 more figures