Adaptive Protein Tokenization
Rohit Dilip, Ayush Varshney, David Van Valen
TL;DR
Adaptive Protein Tokenization introduces a global, coarse-to-fine tokenization scheme (APT) for proteins using a diffusion autoencoder with a discrete bottleneck, enabling fixed-size, globally informative representations. Tokens provide increasing global detail and can be sampled autoregressively, with inference strategies based on token entropy and classifier annealing to balance fidelity and designability. Across reconstruction, generation, and representation tasks, APT matches or surpasses locality-based tokenizers, enables high designability, and supports zero-shot protein shrinking and affinity maturation, illustrating practical utility for scalable, multimodal protein modeling. The approach leverages a loss $\mathcal{L} = \mathcal{L}_{flow} + \lambda_{\text{size}} \mathcal{L}_{\text{size}}$ and a two-stage training pipeline on ~473k AlphaFold2-derived structures, highlighting the potential of global tokenizers to scale protein modeling to large complexes while enabling task-aware control.
Abstract
Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.
