Table of Contents
Fetching ...

Adaptive Protein Tokenization

Rohit Dilip, Ayush Varshney, David Van Valen

TL;DR

Adaptive Protein Tokenization introduces a global, coarse-to-fine tokenization scheme (APT) for proteins using a diffusion autoencoder with a discrete bottleneck, enabling fixed-size, globally informative representations. Tokens provide increasing global detail and can be sampled autoregressively, with inference strategies based on token entropy and classifier annealing to balance fidelity and designability. Across reconstruction, generation, and representation tasks, APT matches or surpasses locality-based tokenizers, enables high designability, and supports zero-shot protein shrinking and affinity maturation, illustrating practical utility for scalable, multimodal protein modeling. The approach leverages a loss $\mathcal{L} = \mathcal{L}_{flow} + \lambda_{\text{size}} \mathcal{L}_{\text{size}}$ and a two-stage training pipeline on ~473k AlphaFold2-derived structures, highlighting the potential of global tokenizers to scale protein modeling to large complexes while enabling task-aware control.

Abstract

Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

Adaptive Protein Tokenization

TL;DR

Adaptive Protein Tokenization introduces a global, coarse-to-fine tokenization scheme (APT) for proteins using a diffusion autoencoder with a discrete bottleneck, enabling fixed-size, globally informative representations. Tokens provide increasing global detail and can be sampled autoregressively, with inference strategies based on token entropy and classifier annealing to balance fidelity and designability. Across reconstruction, generation, and representation tasks, APT matches or surpasses locality-based tokenizers, enables high designability, and supports zero-shot protein shrinking and affinity maturation, illustrating practical utility for scalable, multimodal protein modeling. The approach leverages a loss and a two-stage training pipeline on ~473k AlphaFold2-derived structures, highlighting the potential of global tokenizers to scale protein modeling to large complexes while enabling task-aware control.

Abstract

Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.
Paper Structure (45 sections, 9 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 45 sections, 9 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Prior work uses a single token to represent a local neighborhood around each residue. In our approach, every token provides additional global information, allowing large proteins to be compressed using fewer tokens
  • Figure 2: Overview of our approach. Model training: Raw input coordinates pass through a transformer encoder to create a 1D sequence of conditioning latents. These are discretized and pass condition a diffusion decoder. The noised coordinates are used in a flow loss objective, and the protein size is regressed from the first few latents. Compression: To compress a protein, we encode it and drop tokens from the tail up to a desired reconstruction. Representation learning: Our approach freely provides fixed size, global representations of proteins for downstream tasks, in contrast with most tokenizers that require a mean-pooling operation. Structure generation: During structure generation, we regress the protein size from the first 1-4 tokens, then use the size and conditioning tokens to decode the atomic coordinates. We optionally perform classifier annealing and drop out the tail based on entropy heuristics, providing a scaled method to balance representation fidelity and faithfulness to natural sequences. Applications: Decoupling protein size from conditioning leads to several immediate applications, such as protein miniaturization and affinity maturation.
  • Figure 3: Reconstructions at varying conditioning strengths, colored by secondary structure content. The number of conditioning tokens increases from left to right and by hue. As the number of tokens increases, more disordered secondary structure emerges. For particularly simple details, 16-32 tokens often captures much of the required resolution. Additional visualizations in Appendix \ref{['sec:app:more_reconstructions']}.
  • Figure 4: Impact of the number of tokens on reconstruction quality. We compare against DPLM-2 and ESM-3, the two other tokenized models that demonstrate generative capabilities. Across rFID (left), RMSD (center), and TMscore (right), all metrics improve with additional tokens. The x-axis is the maximum number of tokens used; e.g., for 32 tokens, APT uses at most 20 tokens to encode a protein of length 20.
  • Figure 5: MLP (left) and linear probing (right) results on the CATH-T classification task. We plot top-1 (top) and top-5 (bottom) accuracy. An MLP probe outperforms probing using DPLM2 and ESM3. A linear probe results in weak performance. Our approach provides a convenient way to acquire fixed size representations without squashing information through a mean-pool.
  • ...and 9 more figures