Table of Contents
Fetching ...

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Wenhao Sun, Ji Li, Zhaoqiang Liu

TL;DR

Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain by formulating a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens.

Abstract

Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

TL;DR

Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain by formulating a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens.

Abstract

Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
Paper Structure (40 sections, 17 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 17 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Visual showcases of our JiT framework applied to the FLUX.1-dev model. Our method produces high-fidelity and visually compelling images even at significant acceleration factors of $4\times$ and $7\times$.
  • Figure 2: An overview of our JiT framework, illustrating its core mechanisms and underlying philosophy. (a) The SAG-ODE evolves the latent state by extrapolating a velocity field computed on a sparse subset of tokens. (b) For stage transitions, the DMF evolves newly incorporated tokens to a structurally coherent target with the correct noise level to prevent artifacts. (c) The visualized evolution of the predicted clean image reveals a coarse-to-fine process (global structures first), motivating our strategy to defer computation on detailed regions. (d) The sampling trajectory visualizes our dynamic resource allocation, where the set of active tokens (red flow) starts as a narrow subset and expands over time, reserving full computation for the final detail-refining stages.
  • Figure 3: Qualitative comparison demonstrating the superior performance of our JiT framework. While competing methods suffer from common acceleration artifacts, including semantic errors, loss of fine detail, and structural inconsistencies, our approach maintains high fidelity across different prompts and acceleration levels.
  • Figure 4: Visual ablation study of each component within JiT.
  • Figure 5: An illustration of the construction process for the initial selector matrix $\mathbf{S}_K$.
  • ...and 5 more figures