Table of Contents
Fetching ...

ASurvey: Spatiotemporal Consistency in Video Generation

Zhiyu Yin, Kehai Chen, Xuefeng Bai, Ruili Jiang, Juntao Li, Hongdong Li, Jin Liu, Yang Xiang, Jun Yu, Min Zhang

TL;DR

The paper addresses the challenge of spatiotemporal consistency in video generation and frames five focal aspects—foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics—to understand how modern methods sustain coherence across frames. It surveys a spectrum of approaches, from GANs and autoregressive to diffusion and mask models, and from 3D convolutions and patch-based representations to hierarchical and latent-space generation schemes, highlighting how each contributes to temporal stability. It discusses post-processing tools such as frame interpolation, super-resolution, stabilization, deblurring, stylization, and relighting, and surveys evaluation metrics spanning objective fidelity, temporal smoothness, and subjective quality. The work emphasizes future directions like long video generation, personalization, emotion expression, and the need for temporally aware evaluation frameworks, aiming to guide both research and practical development in video AIGC.

Abstract

Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.

ASurvey: Spatiotemporal Consistency in Video Generation

TL;DR

The paper addresses the challenge of spatiotemporal consistency in video generation and frames five focal aspects—foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics—to understand how modern methods sustain coherence across frames. It surveys a spectrum of approaches, from GANs and autoregressive to diffusion and mask models, and from 3D convolutions and patch-based representations to hierarchical and latent-space generation schemes, highlighting how each contributes to temporal stability. It discusses post-processing tools such as frame interpolation, super-resolution, stabilization, deblurring, stylization, and relighting, and surveys evaluation metrics spanning objective fidelity, temporal smoothness, and subjective quality. The work emphasizes future directions like long video generation, personalization, emotion expression, and the need for temporally aware evaluation frameworks, aiming to guide both research and practical development in video AIGC.

Abstract

Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.

Paper Structure

This paper contains 45 sections, 7 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Diagram of video generation schemes.