Table of Contents
Fetching ...

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin

Abstract

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Abstract

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of and in FID and FDD scores. Project page: https://snap-research.github.io/elit/
Paper Structure (27 sections, 3 equations, 22 figures, 9 tables)

This paper contains 27 sections, 3 equations, 22 figures, 9 tables.

Figures (22)

  • Figure 1: Flexible compute allocation with ELIT. Starting from a vanilla DiT, we add a variable-length set of latent tokens—the latent interface—and two lightweight cross-attention layers, Read and Write. At inference, the number of latent tokens is a user-controlled knob that yields a smooth quality–FLOPs trade-off across DiT, U-ViT, HDiT, and MM-DiT backbones.
  • Figure 2: Adaptive computation. We test whether DiT and ELIT-DiT can reallocate compute across image regions by training on synthetic inputs formed by zero-padding real images, artificially increasing the token count (◆). We compare its performance to baselines trained on real data using patch size $2\!\times\!2$ and patch size $1\!\times\!1$ (★). Vanilla DiT does not improve: attention in zeroed regions targets other zeroed regions (see “DiT Attention”), so extra tokens raise cost without benefits. In contrast, ELIT-DiT uses the Read layer to pull informative spatial tokens into the latent interface (see “Read Attention”), effectively filtering out zeroed areas (see “ELIT-DiT Attention”). Consequently, it leverages the larger token budget and matches the real-data baseline at equal FLOPs.
  • Figure 3: Architecture of ELIT. We extend a DiT-like generator with a variable-length set of latent tokens—the latent interface—and lightweight Read/Write cross-attention layers. A short spatial DiT head processes patchified inputs; Read pulls information into the latent domain where core blocks operate. Write broadcasts updated latents back to spatial tokens, and a small spatial tail produce output. Spatial tokens and latents are partitioned into corresponding groups, with cross-attention operate only within groups. During training, we randomly drop tail latents, yielding an importance-ordered interface. At inference, the number of latents serves as a user-controlled compute knob.
  • Figure 4: Training convergence. ELIT-DiT significantly accelerates convergence, achieving $3.3\times$ speedup on ImageNet-1K 256px and $4.0\times$ on ImageNet-1K 512px.
  • Figure 5: Guidance strategies. ELIT enables autoguidance out of the box by providing a well-aligned weaker model that runs at $\approx 35\%$ of the cost for the unconditional path. When paired with classifier-free guidance (CFG), denoted as cheap CFG (CCFG), it reduces overall generation cost by $\approx 33\%$ while improving quality. Compared to DiT, ELIT-DiT achieves a $19\%$ better best FID.
  • ...and 17 more figures