Table of Contents
Fetching ...

SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers

Shentong Mo, Lanqing Li

TL;DR

SaDiT addresses the inefficiency of diffusion-based protein backbone design by performing generation in a discrete latent space defined by SaProt tokens. It combines a Diffusion Transformer backbone with latent structural tokens to model long-range dependencies while enforcing $SE(3)$-equivariance. A novel IPA Token Cache reduces redundant geometric computations, achieving near-linear scaling in sampling and enabling design of longer proteins (e.g., up to 800 residues) with high designability. Across unconditional and fold-class-conditioned tasks, SaDiT outperforms state-of-the-art methods in speed and structural viability, illustrating the value of tokenization for protein structure generation.

Abstract

Generative models for de novo protein backbone design have achieved remarkable success in creating novel protein structures. However, these diffusion-based approaches remain computationally intensive and slower than desired for large-scale structural exploration. While recent efforts like Proteina have introduced flow-matching to improve sampling efficiency, the potential of tokenization for structural compression and acceleration remains largely unexplored in the protein domain. In this work, we present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture. SaDiT leverages a discrete latent space to represent protein geometry, significantly reducing the complexity of the generation process while maintaining theoretical SE(3) equivalence. To further enhance efficiency, we introduce an IPA Token Cache mechanism that optimizes the Invariant Point Attention (IPA) layers by reusing computed token states during iterative sampling. Experimental results demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability. We evaluate our model across unconditional backbone generation and fold-class conditional generation tasks, where SaDiT shows superior ability to capture complex topological features with high designability.

SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers

TL;DR

SaDiT addresses the inefficiency of diffusion-based protein backbone design by performing generation in a discrete latent space defined by SaProt tokens. It combines a Diffusion Transformer backbone with latent structural tokens to model long-range dependencies while enforcing -equivariance. A novel IPA Token Cache reduces redundant geometric computations, achieving near-linear scaling in sampling and enabling design of longer proteins (e.g., up to 800 residues) with high designability. Across unconditional and fold-class-conditioned tasks, SaDiT outperforms state-of-the-art methods in speed and structural viability, illustrating the value of tokenization for protein structure generation.

Abstract

Generative models for de novo protein backbone design have achieved remarkable success in creating novel protein structures. However, these diffusion-based approaches remain computationally intensive and slower than desired for large-scale structural exploration. While recent efforts like Proteina have introduced flow-matching to improve sampling efficiency, the potential of tokenization for structural compression and acceleration remains largely unexplored in the protein domain. In this work, we present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture. SaDiT leverages a discrete latent space to represent protein geometry, significantly reducing the complexity of the generation process while maintaining theoretical SE(3) equivalence. To further enhance efficiency, we introduce an IPA Token Cache mechanism that optimizes the Invariant Point Attention (IPA) layers by reusing computed token states during iterative sampling. Experimental results demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability. We evaluate our model across unconditional backbone generation and fold-class conditional generation tasks, where SaDiT shows superior ability to capture complex topological features with high designability.
Paper Structure (27 sections, 9 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 9 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparison of Diffusion Trajectories in Coordinate vs. Latent Structural Manifolds. (Left) Coordinate-based diffusion (e.g., RFDiffusion) must navigate a high-dimensional, continuous energy landscape characterized by numerous non-physical local minima, leading to "jittered" trajectories and potential structural inconsistencies. (Right) SaDiT operates on the SaProt Discrete Latent Manifold, where the structural search space is regularized into a grid of pre-validated geometric tokens. This discrete bottleneck dampens coordinate noise and enables early topological convergence, as the model transitions from a high-entropy latent state to a stable, designable fold significantly faster than coordinate-space baselines.
  • Figure 2: Illustration of the proposed SaDiT framework for protein backbone generation. The pipeline consists of three integrated modules: (a) SaProt Tokenization: A geometry-invariant encoder maps raw protein coordinates ($SE(3)$) into a discrete manifold of structural tokens, regularizing the search space into pre-validated geometric states. (b) Diffusion Transformer: The generative backbone utilizing Invariant Point Attention (IPA) and Adaptive Layer Normalization (adaLN) to model long-range topological dependencies between noisy structural tokens. (c) IPA Token Cache: During reverse diffusion, as the structure crystallizes (low active token fraction $\rho_t$), the model reuses spatial affinities to skip redundant operations, enabling near-linear memory scaling.
  • Figure 3: Computational Efficiency and Scaling via IPA Token Caching. (a) Token Convergence Dynamics: Evolution of the active token fraction $\rho_t$ across the reverse diffusion process. For longer chains ($L=800$), the global topology crystallizes more decisively, allowing the model to bypass redundant spatial computations during the final 20% of sampling. (b) Peak Memory Scaling: Comparison of memory overhead between standard IPA and the SaDiT caching mechanism. By exploiting structural convergence, SaDiT shifts from quadratic $O(L^2)$ scaling toward a near-linear growth profile, achieving a 70% reduction in peak memory usage for large proteins.
  • Figure 4: IPA Token Cache Sensitivity and Temporal Utility. (a) Speed-Accuracy Pareto Trade-off: As the threshold $\epsilon$ increases, sampling time decreases significantly due to higher cache reuse, with only a marginal impact on structural precision (scRMSD). We select $\epsilon = 0.05$ as a balanced operating point that provides a 25% speedup while maintaining high structural fidelity. (b) Cache Utility vs. Sampling Stage: The efficacy of the caching mechanism is highly time-dependent. In the early stages of diffusion ($t/T < 0.7$), tokens undergo significant displacement as the global fold is established. In the final 30% of steps (Refinement Phase), the structure stabilizes, leading to a high cache hit rate and near-linear computational scaling for spatial attention.
  • Figure 5: Impact of Tokenization Granularity on Fidelity and Speed. (a) Structural Fidelity vs. Token Granularity: While $\alpha$-helical structures are relatively robust to downsampling, $\beta$-sheet heavy topologies exhibit a sharp, non-linear drop in designability as the downsampling factor $k$ increases. This is due to the loss of fine-grained geometric constraints required for precise $\beta$-strand alignment. (b) Inference Speedup: Increasing $k$ provides near-linear gains in sampling speed by reducing the effective sequence length. We select $k=1$ as our default to ensure maximum structural fidelity across all fold classes, utilizing tokenization primarily for its manifold regularization properties rather than aggressive sequence compression.
  • ...and 3 more figures