Table of Contents
Fetching ...

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, Jun Zhang

Abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

Paper Structure

This paper contains 22 sections, 17 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Compositionality deficit of DMD. First, middle, and last frames from 4-/8-/16-step DMD students (rows, top to bottom) on: (a) "...a spaceman wearing a red wool knitted motorcycle helmet..." and (b) "...a large stack of vintage televisions all showing different programs...museum gallery." Increasing the number of denoising steps degrades rather than improves quality: the 16-step model loses the knitted helmet texture and corrupts the motorcycle structure, and produces incoherent television details.
  • Figure 1: Displacement-normalized local semigroup defect on the test-time 4-step inference path. For each adjacent inference interval $(t_s,t_e)$, we compare the direct endpoint $x_{t_e}^{(1)}=\Psi_{\theta}^{t_s \rightarrow t_e}(x_{t_s})$ against the composed endpoint $x_{t_e}^{(2)}=\Psi_{\theta}^{t_m \rightarrow t_e}(\Psi_{\theta}^{t_s \rightarrow t_m}(x_{t_s}))$, where $t_m$ is the corresponding intermediate timestep from the finer training grid. Lower is better. SC-DMD achieves a lower overall local semigroup defect than the DMD baseline, supporting the claim that self-consistency regularization improves the compositional behavior of the learned denoising operator.
  • Figure 2: Comparison of training trajectories for few-step distillation methods.Consistency distillation, shortcut models, and flow-map distillation all impose composition-related constraints in different forms. SC-DMD addresses the compositionality deficit by introducing a semigroup-defect regularizer that aligns direct and composed updates while preserving DMD as the distribution matching objective.
  • Figure 2: More qualitative comparisons between the Causal Forcing zhu2026causal baseline and our method. Our method shows consistent advantages in both visual quality and semantic consistency. Compared with the baseline, our results better preserve subject identity, object geometry, and scene composition across frames, while also producing smoother motion progression. The reading-girl example highlights reduced semantic/identity drift; the trombone and grape examples show improved structural stability and visual fidelity; and the umbrella example demonstrates more coherent subject interaction and temporal evolution.
  • Figure 3: Overview of Salt for autoregressive video generation.Left: A step count $K\!\in\!\{8,4,2\}$ is sampled to define the few-step denoising trajectory. Middle: Conditioned on the current KV cache, text, and noise, the student generator $G_\theta$ denoises the current chunk. A self-consistency (SC) loss $\mathcal{L}_{\mathrm{SC}}$ regularizes the endpoint discrepancy between a direct update and a composed two-step update. Right: During mixed-step autoregressive training, the predicted clean sample is re-noised at a random timestep and optimized with the DMD loss $\mathcal{L}_{\mathrm{DMD}}$, while a higher-quality reference branch provides the cache-conditioned alignment loss $\mathcal{L}_{\mathrm{align}}$.
  • ...and 2 more figures