Table of Contents
Fetching ...

Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

Andrew Liu, Axel Elaldi, Nathan Russell, Olivia Viessmann

TL;DR

Bio2Token introduces an all-atom biomolecule tokenizer built on a Finite-Scalar Quantization (FSQ) auto-encoder powered by the Mamba state-space model to encode 3D atomic coordinates into a 4096-token vocabulary. By avoiding SE(3)-invariant architectures, the approach achieves sub-angstrom reconstruction across proteins, RNA, and small molecules, scaling to tens of thousands of atoms and enabling all-atom generative modeling. The framework demonstrates competitive accuracy (e.g., RMSEs near $0.5$–$0.6\AA$ for macromolecules and $\sim0.2\AA$ for small molecules) and superior computational efficiency relative to IPA-based decoders, validating a scalable path toward cross-domain, all-atom structure tokenization. The work provides open-source code and data, underscoring the potential of Mamba-based tokenizers to advance biomolecular design and integration with language-model workflows.

Abstract

Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies well below 1 Angstrom. We demonstrate that a simple Mamba state space model architecture is efficient compared to an SE(3)-invariant IPA architecture, reaches competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom generative models in the future.

Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

TL;DR

Bio2Token introduces an all-atom biomolecule tokenizer built on a Finite-Scalar Quantization (FSQ) auto-encoder powered by the Mamba state-space model to encode 3D atomic coordinates into a 4096-token vocabulary. By avoiding SE(3)-invariant architectures, the approach achieves sub-angstrom reconstruction across proteins, RNA, and small molecules, scaling to tens of thousands of atoms and enabling all-atom generative modeling. The framework demonstrates competitive accuracy (e.g., RMSEs near for macromolecules and for small molecules) and superior computational efficiency relative to IPA-based decoders, validating a scalable path toward cross-domain, all-atom structure tokenization. The work provides open-source code and data, underscoring the potential of Mamba-based tokenizers to advance biomolecular design and integration with language-model workflows.

Abstract

Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies well below 1 Angstrom. We demonstrate that a simple Mamba state space model architecture is efficient compared to an SE(3)-invariant IPA architecture, reaches competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom generative models in the future.

Paper Structure

This paper contains 40 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: [A] Biomolecular system of interest with many thousands of atoms, with a magnified section with annotations for specific points in the point cloud. [B] Illustration of our tokenizer model, transforming point clouds into tokens and then back to point clouds. [C] Implementation details of the bidirectional Mamba layer. The first branch processes the original input using a Mamba block, while the second branch handles the flipped version of the input, reversing the output back to its original orientation afterward. The final step involves adding the results of both branches together. Notably, the two Mamba blocks in the branches share the same weights.
  • Figure 2: 3D renderings of ground truth molecules in green and reconstructions from decoded coordinates in blue. Ground truth molecules are made transparent in the ball and stick panels to make it easier to see the auto reconstructed models. Visuals prepared with Mol* molstar (A) Example from $\nabla^2$DFT scaffold split test set + mol2token reconstructed result. (B) RNA-Protein complex, PDB = 3WBM reconstruction by bio2token (C) Multi chain RNA complex, PDB = 7PTL. Reconstruction by bio2token (D) neighborhood of residue on loop of 3WBM found near center of coordinate space (E) close up of RNA helix of 3WBM (F) Example of errors found near edge of coordinate space.
  • Figure 3: reconstruction results on all test data. Numeric values are provided in Appendix tables \ref{['tab:tokenizer-performance']} - \ref{['tab:tokenizer-outofdomain']}. For small molecules, only the domain-specific tokenizer mol2token and the combined bio2token achieve reasonable accuracy of $0.25$-$0.35$ Å. For proteins (CATH4.2 test, CASP14/15) protein2token and bio2token achieve the best results. For the RNA3DB test set rna2token and bio2token have comparable results with reconstructions around $0.6$Å. Macromolecules cannot be reconstructed from the small molecule mol2token vocabulary.
  • Figure 4: Token circularity with rotations. A and B visualise a $\pi$ rotation of the protein around the z- and x-axis. The zoom into the GLN amino acid shows how the individual atoms are changing orientations with respect to the centre. The respective token ids of each atom on the highlighted GLN are plotted in C) and D) as a function of rotation angle. The green and red dotted lines correspond to the tokens at the positions in A) and B).
  • Figure 5: Average mixing radius of per-atom position information with increasing number of Mamba blocks in the encoder.
  • ...and 4 more figures