Table of Contents
Fetching ...

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu

TL;DR

RepVideo tackles the instability of cross-layer representations in transformer-based text-to-video diffusion models, which harms temporal coherence and spatial detail. It introduces a lightweight, cross-layer enhancement consisting of a Feature Cache Module and a gating mechanism to aggregate and fuse multi-layer features into enriched inputs for each transformer layer. Across automated benchmarks and human studies, RepVideo outperforms strong baselines in motion smoothness, spatial relationships, and overall video quality, demonstrating that richer, stabilized representations can significantly improve video generation without large architectural changes. The approach provides a practical path toward more coherent, detailed, and semantically aligned video synthesis in diffusion-based frameworks.

Abstract

Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

RepVideo: Rethinking Cross-Layer Representation for Video Generation

TL;DR

RepVideo tackles the instability of cross-layer representations in transformer-based text-to-video diffusion models, which harms temporal coherence and spatial detail. It introduces a lightweight, cross-layer enhancement consisting of a Feature Cache Module and a gating mechanism to aggregate and fuse multi-layer features into enriched inputs for each transformer layer. Across automated benchmarks and human studies, RepVideo outperforms strong baselines in motion smoothness, spatial relationships, and overall video quality, demonstrating that richer, stabilized representations can significantly improve video generation without large architectural changes. The approach provides a practical path toward more coherent, detailed, and semantically aligned video synthesis in diffusion-based frameworks.

Abstract

Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
Paper Structure (17 sections, 8 equations, 12 figures, 2 tables)

This paper contains 17 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The examples generated by RepVideo. RepVideo effectively generates diverse videos with enhanced temporal coherence and fine-grained spatial details.
  • Figure 2: The architecture of recent transformer-based video diffusion models. These methods typically consist of three core components: a 3D VAE, the text encoder, and a transformer network.
  • Figure 3: The visualization of the attention distribution of each frame’s token across the entire token sequence. The results highlight significant variations in attention distributions across layers, with deeper layers focusing more on tokens from the same frame and exhibiting weaker attention to tokens from other frames.
  • Figure 4: The visualization of attention maps across transformer layers. Each layer attends to distinct regions, capturing diverse spatial features. However, the lack of coordination across layers results in fragmented feature representations, weakening the model’s ability to establish coherent spatial semantics within individual frames.
  • Figure 5: The average similarity between adjacent frame features across diffusion layers and denoising steps. The similarity decreases as layer depth increases for a given denoising step, indicating greater differentiation in deeper layers. Additionally, similarity between adjacent frames declines as the denoising process progresses.
  • ...and 7 more figures