Table of Contents
Fetching ...

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, Nenghai Yu

TL;DR

This paper proposes HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level.

Abstract

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

TL;DR

This paper proposes HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level.

Abstract

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.
Paper Structure (14 sections, 11 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 11 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Motivation. (a) Bidirectional diffusion (Wan2.1) proves that a shared noise level provides sufficient context for temporal coherence, though limited to a fixed horizon. (b) Standard AR (Self-Forcing) scales length but suffers quality drift, as conditioning on fully clean context amplifies error propagation. (c) Applying our hierarchical denoising (matched-noise context) only at inference (w/o training) mitigates drift but breaks continuity due to train--test mismatch; HiAR retrains under the hierarchical pipeline (w/ training), achieving scalable long-video generation with stable quality and seamless continuity.
  • Figure 2: Overview of HiAR.Left: Existing block-first AR (e.g., Self-Forcing) fully denoises each block before generating the next, conditioning every step on predicted clean context and thus amplifying inter-block error propagation. Right: Our hierarchical denoising performs causal generation across all blocks within each denoising step, conditioning on context at the matched noise level to suppress error accumulation. Bottom: Training combines causal self-rollout with a reverse-KL (DMD) loss for distillation, and a forward-KL regulariser computed in bidirectional-attention mode via teacher trajectory sampling to preserve motion diversity.
  • Figure 3: Qualitative comparison of distilled AR models at 20 s. We show temporally sampled frames from six diverse prompts covering natural scenery, objects, and human subjects. HiAR maintains consistent colour and detail throughout, while baselines exhibit progressive degradation.
  • Figure 4: Correlation between bidirectional and causal dynamics during training (w/o $\mathcal{L}_{\text{FKL}}$). Each point represents one training checkpoint; colour encodes the training step. A strong positive correlation (Pearson $r=0.968$) confirms that the low-motion shortcut affects both attention modes simultaneously and that regularising the bidirectional mode effectively constrains causal-mode dynamics.
  • Figure 5: Comparison of single-step denoising under bidirectional vs. causal attention. Bidirectional attention produces frames of uniform quality and blur across all positions, while causal attention yields progressively sharper frames as preceding context reduces uncertainty for later positions.