The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Ailing Zeng; Yuhang Yang; Weidong Chen; Wei Liu

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Ailing Zeng, Yuhang Yang, Weidong Chen, Wei Liu

TL;DR

Despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking, and the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements.

Abstract

High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

TL;DR

Abstract

Paper Structure (56 sections, 88 figures, 1 table)

This paper contains 56 sections, 88 figures, 1 table.

Introduction
Task Definition and Input Modalities
Text-to-Video Generation.
Image-to-Video Generation.
Video-to-Video Generation.
SORA-like Model Objectives
Closed-source Models
SORA (OpenAI) opensora
Kling (Kuaishou) kuaishou2024kling
Dream Machine (LumaLabs) luma2024dm
Gen-3 Alpha (Runway) runway2024gen3
Vidu bao2024vidu (Shengshu) shengshu2024vidu
Qingying (Zhipu) zhipu2024qingying
Hailuo (MiniMax) minimax2024hailuo
Wanxiang (Ali Tongyi) tongyi2024wanxiang
...and 41 more sections

Figures (88)

Figure 1: The timeline of recent SORA-like models, including closed-source models (the upper) and open-source models (the lower). We summarize and introduce these models in this report.
Figure 2: Overview of Section \ref{['sec:2']}. We compare with existing vertical-domain video models, including human-centric animation, robotics, cartoon animation, world models, autonomous driving, and camera controls in the video generation area.
Figure 3: Comparisons with the pose-controllable image animation (e.g., Animate-Anyone hu2024animate_anyone). Prompt: (I2V-591) "The camera remains still, swinging the person's left and right hands back and forth. At the same time, the left and right feet move rhythmically." It is hard to generate continuous and complex actions solely through text control, meanwhile, there are still limitations in ID preservation.
Figure 4: Comparisons with the pose-controllable image animation (e.g., Animate-Anyone hu2024animate_anyone). Prompt: (I2V-593) "The camera stays still as the man walks to the camera from a distance." When performing simple motions such as walking, most models can generate plausible results, but some may generate actions but do not follow the direction of the instructions, e.g., QingYing and Kling1.5.
Figure 5: Comparisons with the pose-controllable portrait animation (e.g., Follow-your EMOJI ma2024emoji) in a photo-realistic style. Prompt: (I2V-598) "The boy makes an exaggerated expression on his face." The models have generally generated content that aligns with the intended facial expressions, but it is difficult to maintain facial identity under large expressive movements.
...and 83 more figures

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

TL;DR

Abstract

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Authors

TL;DR

Abstract

Table of Contents

Figures (88)