Table of Contents
Fetching ...

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Hainuo Wang, Mingjia Li, Xiaojie Guo

Abstract

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Abstract

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.
Paper Structure (22 sections, 18 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 18 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: An overview of our Waypoint Diffusion Transformers. (a) and (b) demonstrate the difference in trajectories before/after the waypoint is introduced. In standard pixel-space FM (a), mapping directly to an entangled, non-discriminative pixel manifold (d) induces severe trajectory conflict. With the integration of discriminative semantic waypoints (c), our WiT successfully converts the noise-to-pixel task into two stable, decoupled mappings. By routing the transport path, the generative flow is disentangled, thus mitigating path overlap. Consequently, WiT significantly accelerates convergence compared to baseline (e) while yielding highly realistic generated samples (f).
  • Figure 2: Overview of the WiT architecture. Left: A lightweight Waypoints Generator (21M params) predicts Semantic Waypoints from the noisy state $z_t$. Right: The Pixel Space Generator synthesizes the image, utilizing these predicted waypoints as spatial conditions via the Just-Pixel AdaLN mechanism.
  • Figure 3: (a) Just-Pixel AdaLN: The predicted semantic waypoints provide spatially varying modulation. (b) Visualization of the predicted semantic waypoints and intermediate pixel states during inference. Left. The evolving noisy pixel states $z_t$ at different integration timesteps. Right. The corresponding spatial semantic waypoints $\hat{s}_0$ dynamically inferred by our lightweight Waypoints Generator.
  • Figure 4: Qualitative Results of WiT-L/16 on ImageNet $256 \times 256$deng2009imagenet.
  • Figure 5: The impact of CFG on FID and IS. The gold star indicates the minimum FID.
  • ...and 8 more figures