Table of Contents
Fetching ...

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou

TL;DR

An innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images, which has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

Abstract

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

TL;DR

An innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images, which has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

Abstract

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.
Paper Structure (15 sections, 5 equations, 2 figures, 2 tables)

This paper contains 15 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The framework depicted in the figure, follows a concise workflow: Users input Image-Text data, which undergoes transformation by our LLM-based Prompt Enhancer, resulting in three specialized sub-prompts—keywords, input frame state, and optimization prompt. These sub-prompts, alongside the original image, serve as multimodal inputs for the Key Frame Generator. Utilizing a suite of SOTA models, the generator crafts a key frame, which, when paired with the initial image, is then inputted into the Video Generator. With the guidance of the optimization prompt, the Video Generator synthesizes the final video.
  • Figure 2: In our qualitative analysis, we input image-text pairs into both our algorithm and the DynamiCrafter algorithm for experiments. The figure depicts this process, showing two sets of image-text pairs on the left. Each set includes an image and three prompts. For every prompt, both algorithms generate a corresponding video. Also, the figure displays four frames located in the top-left, top-right, bottom-left, and bottom-right corners, illustrating the sequence of frames in the generated videos.