Table of Contents
Fetching ...

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Ailing Zeng, Yuhang Yang, Weidong Chen, Wei Liu

TL;DR

Despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking, and the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements.

Abstract

High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

TL;DR

Despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking, and the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements.

Abstract

High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.
Paper Structure (56 sections, 88 figures, 1 table)

This paper contains 56 sections, 88 figures, 1 table.

Figures (88)

  • Figure 1: The timeline of recent SORA-like models, including closed-source models (the upper) and open-source models (the lower). We summarize and introduce these models in this report.
  • Figure 2: Overview of Section \ref{['sec:2']}. We compare with existing vertical-domain video models, including human-centric animation, robotics, cartoon animation, world models, autonomous driving, and camera controls in the video generation area.
  • Figure 3: Comparisons with the pose-controllable image animation (e.g., Animate-Anyone hu2024animate_anyone). Prompt: (I2V-591) "The camera remains still, swinging the person's left and right hands back and forth. At the same time, the left and right feet move rhythmically." It is hard to generate continuous and complex actions solely through text control, meanwhile, there are still limitations in ID preservation.
  • Figure 4: Comparisons with the pose-controllable image animation (e.g., Animate-Anyone hu2024animate_anyone). Prompt: (I2V-593) "The camera stays still as the man walks to the camera from a distance." When performing simple motions such as walking, most models can generate plausible results, but some may generate actions but do not follow the direction of the instructions, e.g., QingYing and Kling1.5.
  • Figure 5: Comparisons with the pose-controllable portrait animation (e.g., Follow-your EMOJI ma2024emoji) in a photo-realistic style. Prompt: (I2V-598) "The boy makes an exaggerated expression on his face." The models have generally generated content that aligns with the intended facial expressions, but it is difficult to maintain facial identity under large expressive movements.
  • ...and 83 more figures