Bio2Token: All-atom tokenization of any biomolecular structure with Mamba
Andrew Liu, Axel Elaldi, Nathan Russell, Olivia Viessmann
TL;DR
Bio2Token introduces an all-atom biomolecule tokenizer built on a Finite-Scalar Quantization (FSQ) auto-encoder powered by the Mamba state-space model to encode 3D atomic coordinates into a 4096-token vocabulary. By avoiding SE(3)-invariant architectures, the approach achieves sub-angstrom reconstruction across proteins, RNA, and small molecules, scaling to tens of thousands of atoms and enabling all-atom generative modeling. The framework demonstrates competitive accuracy (e.g., RMSEs near $0.5$–$0.6\AA$ for macromolecules and $\sim0.2\AA$ for small molecules) and superior computational efficiency relative to IPA-based decoders, validating a scalable path toward cross-domain, all-atom structure tokenization. The work provides open-source code and data, underscoring the potential of Mamba-based tokenizers to advance biomolecular design and integration with language-model workflows.
Abstract
Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies well below 1 Angstrom. We demonstrate that a simple Mamba state space model architecture is efficient compared to an SE(3)-invariant IPA architecture, reaches competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom generative models in the future.
