Table of Contents
Fetching ...

Enhance-A-Video: Better Generated Video for Free

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You

TL;DR

Enhance-A-Video presents a training-free, plug-in method to boost temporal coherence and visual fidelity in diffusion-transformer–based video generation by leveraging cross-frame information in temporal attention. It introduces Cross-Frame Intensity (CFI) derived from non-diagonal attention and an enhanced temperature mechanism via a dedicated Enhance Block, integrated in a residual path to modestly amplify cross-frame signals while preserving intra-frame details. The approach is model-agnostic and demonstrated across both 3D full-attention and spatial-temporal DiT-based models (e.g., HunyuanVideo, CogVideoX, LTX-Video, Open-Sora), yielding improved temporal consistency and visual quality with minimal inference overhead. Quantitative user studies and VBench evaluations corroborate the qualitative gains, and ablations highlight moderate temperature values and clipping as key to stable, high-quality enhancements. The work opens avenues for adaptive temperature control and joint attention enhancements, suggesting practical impact for real-time video generation and editing workflows.

Abstract

DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.

Enhance-A-Video: Better Generated Video for Free

TL;DR

Enhance-A-Video presents a training-free, plug-in method to boost temporal coherence and visual fidelity in diffusion-transformer–based video generation by leveraging cross-frame information in temporal attention. It introduces Cross-Frame Intensity (CFI) derived from non-diagonal attention and an enhanced temperature mechanism via a dedicated Enhance Block, integrated in a residual path to modestly amplify cross-frame signals while preserving intra-frame details. The approach is model-agnostic and demonstrated across both 3D full-attention and spatial-temporal DiT-based models (e.g., HunyuanVideo, CogVideoX, LTX-Video, Open-Sora), yielding improved temporal consistency and visual quality with minimal inference overhead. Quantitative user studies and VBench evaluations corroborate the qualitative gains, and ablations highlight moderate temperature values and clipping as key to stable, high-quality enhancements. The work opens avenues for adaptive temperature control and joint attention enhancements, suggesting practical impact for real-time video generation and editing workflows.

Abstract

DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.

Paper Structure

This paper contains 21 sections, 12 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Enhance-A-Video boosts diffusion transformers-based video generation quality at minimal cost - no training needed, no extra learnable parameters, no memory overhead. Detailed captions are available in Appendix \ref{['appendix:abstract']}.
  • Figure 2: Video sample of HunyuanVideo model with unnatural head movements, repeated right hands and conflicting glove color.
  • Figure 3: Visualization of temporal attention distributions in Open-Sora for blocks 2, 14, and 26 at denoising step 30, where non-diagonal elements are considerably weaker than diagonal elements.
  • Figure 4: Overview of the Enhance Block. The block computes the average of non-diagonal elements from the temporal attention map as Cross-Frame Intensity (CFI). The CFI is scaled by the temperature parameter and fused back to enhance the temporal attention output.
  • Figure 5: Temporal attention difference map between original CogVideoX model and w/ Enhance-A-Video of layer 29 at denoising step 50. Non-diagonal elements in the attention matrix of w/ Enhance-A-Video show higher values (shown in blue), while diagonal elements have reduced values (shown in red).
  • ...and 14 more figures