Table of Contents
Fetching ...

Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT

Sixiao Zheng, Jingyang Huo, Yu Wang, Yanwei Fu

TL;DR

An Intelligent Director framework is proposed, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names, which aims to automatically integrate various media elements based on user requirements and create storytelling videos.

Abstract

With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition (DVC) task, an interesting and challenging task that aims to automatically integrate various media elements based on user requirements and create storytelling videos. We propose an Intelligent Director framework, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names. Then, the best-matched music is obtained through music retrieval. Then, materials such as captions, images, videos, and music are integrated to seamlessly synthesize the video. Finally, we apply AnimeGANv2 for style transfer. We construct UCF101-DVC and Personal Album datasets and verified the effectiveness of our framework in solving DVC through qualitative and quantitative comparisons, along with user studies, demonstrating its substantial potential.

Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT

TL;DR

An Intelligent Director framework is proposed, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names, which aims to automatically integrate various media elements based on user requirements and create storytelling videos.

Abstract

With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition (DVC) task, an interesting and challenging task that aims to automatically integrate various media elements based on user requirements and create storytelling videos. We propose an Intelligent Director framework, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names. Then, the best-matched music is obtained through music retrieval. Then, materials such as captions, images, videos, and music are integrated to seamlessly synthesize the video. Finally, we apply AnimeGANv2 for style transfer. We construct UCF101-DVC and Personal Album datasets and verified the effectiveness of our framework in solving DVC through qualitative and quantitative comparisons, along with user studies, demonstrating its substantial potential.
Paper Structure (30 sections, 9 figures, 4 tables)

This paper contains 30 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Dynamic Visual Composition.
  • Figure 2: An overview of our Intelligent Director framework for Dynamic Visual Composition. Intelligent Director consists of four main steps: (1) Caption Generation, (2) Music Retrieval, (3) Video Composition, and (4) Style Transfer. In caption generation, LENS generates descriptions for images and video key frames extracted by pHash, and then ChatGPT creates coherent and storytelling captions and recommends suitable music name. In music retrieval, we utilize the music name recommended by ChatGPT to search a large music library for the best-matched music. In video composition, the seamless integration of captions, images, videos, and music is achieved through four steps: caption fusion, material fine-tuning, switching animation, and music fusion. Finally, in style transfer, AnimeGANv2 transforms the video to other styles, such as the animated style of Kon Satoshi.
  • Figure 3: Results of style transfer with three animated styles on the Personal Album Dataset.
  • Figure 4: Comparison results between our framework and three baseline models on the Personal Album Dataset.
  • Figure 5: Comparison results between our framework and three baseline models on the UCF101-DVC Dataset.
  • ...and 4 more figures