Table of Contents
Fetching ...

Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang

TL;DR

Diffusion models incur high training and inference costs, motivating a trajectory-based, image-free distillation method. The paper proposes TBCM, which distills knowledge entirely in latent space by sampling along the teacher's inference trajectory and using multiple trajectory points per prompt, thereby removing VAE data dependence. This approach reduces GPU memory and training time while maintaining or improving one-step generation quality, and it provides insights into how generation-space sampling and trajectory design affect distillation efficacy. Empirical results on MJHQ-30k show strong FID and CLIP performance with substantial efficiency gains, and the work analyzes equivalent-noise discrepancies and sampling schemes to guide future consistency distillation.

Abstract

Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

TL;DR

Diffusion models incur high training and inference costs, motivating a trajectory-based, image-free distillation method. The paper proposes TBCM, which distills knowledge entirely in latent space by sampling along the teacher's inference trajectory and using multiple trajectory points per prompt, thereby removing VAE data dependence. This approach reduces GPU memory and training time while maintaining or improving one-step generation quality, and it provides insights into how generation-space sampling and trajectory design affect distillation efficacy. Empirical results on MJHQ-30k show strong FID and CLIP performance with substantial efficiency gains, and the work analyzes equivalent-noise discrepancies and sampling schemes to guide future consistency distillation.

Abstract

Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

Paper Structure

This paper contains 18 sections, 9 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Comprehensive Comparison.Left: GPU memory usage versus batch size during training, where Batch Size denotes the number of samples actually involved in optimization. Middle: Comparison of FID scores and throughput across different methods; the marker size indicates the model parameter count. Right: GPU memory consumption and total training time under identical training configurations.
  • Figure 2: One Step Generation Results. High-resolution (1024×1024) images generated by our one-step generator distilled from the Sana 0.6B model using the proposed TBCM. More results with different sampling steps are provided in the Appendix.
  • Figure 3: Discrepancy of Equivalent Noise Between Forward and Backward Processes. The equivalent noise (see Eq. (\ref{['eq:eqv noise']})) remains constant in forward diffusion, but evolves noticeably in backward generation, reflecting the training–inference inconsistency.
  • Figure 4: Resource Bottlenecks in Continuous-Time Consistency Distillation.Top: Memory usage breakdown during distillation. Bottom: Training time breakdown during distillation.
  • Figure 5: Distillation Paradigm of TBCM.Left: Distillation begins with random noise and text prompt inputs. Middle: Multiple samples are generated for a single prompt within the latent space. Right: The collected samples are used to compute the consistency loss.
  • ...and 7 more figures