Table of Contents
Fetching ...

From Sora What We Can See: A Survey of Text-to-Video Generation

Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan

TL;DR

The survey dissects text-to-video generation through the lens of Sora, outlining foundational models (GANs, VAEs, diffusion, autoregressive, and transformers), and organizing literature into evolutionary generators, pursuit of extended duration, high resolution, and seamless quality, plus a realistic panorama of motion, scenes, and layouts. It details datasets and evaluation metrics, identifies key challenges (motion coherence, multi-object interactions, data privacy, and long-range consistency), and articulates future directions including robot learning from visual guidance, infinite 3D scene reconstruction, augmented digital twins, and normative AI frameworks. By connecting methodological advances with practical concerns, the work provides a comprehensive, technically grounded roadmap for advancing T2V research and its applications. The synthesis emphasizes diffusion-transformer architectures and cross-modal conditioning as central drivers, while calling for robust evaluation standards and responsible deployment guidelines.

Abstract

With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.

From Sora What We Can See: A Survey of Text-to-Video Generation

TL;DR

The survey dissects text-to-video generation through the lens of Sora, outlining foundational models (GANs, VAEs, diffusion, autoregressive, and transformers), and organizing literature into evolutionary generators, pursuit of extended duration, high resolution, and seamless quality, plus a realistic panorama of motion, scenes, and layouts. It details datasets and evaluation metrics, identifies key challenges (motion coherence, multi-object interactions, data privacy, and long-range consistency), and articulates future directions including robot learning from visual guidance, infinite 3D scene reconstruction, augmented digital twins, and normative AI frameworks. By connecting methodological advances with practical concerns, the work provides a comprehensive, technically grounded roadmap for advancing T2V research and its applications. The synthesis emphasizes diffusion-transformer architectures and cross-modal conditioning as central drivers, while calling for robust evaluation standards and responsible deployment guidelines.

Abstract

With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.
Paper Structure (37 sections, 22 equations, 5 figures, 1 table)

This paper contains 37 sections, 22 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Text-to-video (T2V) generation is a flourishing research area, which has gone through several iterations in recent years. Early works are limited to simple scenes (low-resolution, single-object, and short-duration). Subsequently, benefiting from the success achieved by the diffusion model in the generative area, current works are generating more complex videos and various tools have been commercially successful. Sora, with longer prompts processing capacity and minute-level world-simulative video generation, is an extremely promising T2V tool but it also faces challenges and open problems.
  • Figure 2: Illustrations of different generators.
  • Figure 3: The structure of section From Sora What We Can See.
  • Figure 4: T2V Generators Evolutionary timeline based on foundational algorithms.
  • Figure 5: Screenshots of Sora generated video with its prompts from sora_web