Table of Contents
Fetching ...

Learning Long-form Video Prior via Generative Pre-Training

Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

TL;DR

The paper investigates learning the long-form video prior by framing movies as sequences of tokenized visual locations and text, enabling generative pre-training with a GPT-like transformer. It introduces Storyboard20K, a large, richly annotated dataset of storyboards (synopses, bounding boxes, and whole-body keypoints with consistent IDs) to support token-based modeling beyond pixel-space. The method tokenizes coordinates and keypoints into discrete tokens, designs JSON-style prompts, and trains a GPT-2 base decoder to maximize the next-token likelihood $L = \sum_i \log p(u_i | u_{i-k}, \ldots, u_{i-1}; \Theta)$. Experiments show superior textual coherence and layout realism compared with GPT-3.5 baselines, and demonstrate the potential of long-form video prior to guide controllable diffusion processes and assist filmmaking, highlighting Storyboard20K as a valuable resource for future multi-modal video generation research.

Abstract

Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with consistent IDs, bounding boxes, and whole body keypoints. In this way, long-form videos can be represented by a set of tokens and be learned via generative pre-training. Experimental results validate that our approach has great potential for learning long-form video prior. Code and data will be released at \url{https://github.com/showlab/Long-form-Video-Prior}.

Learning Long-form Video Prior via Generative Pre-Training

TL;DR

The paper investigates learning the long-form video prior by framing movies as sequences of tokenized visual locations and text, enabling generative pre-training with a GPT-like transformer. It introduces Storyboard20K, a large, richly annotated dataset of storyboards (synopses, bounding boxes, and whole-body keypoints with consistent IDs) to support token-based modeling beyond pixel-space. The method tokenizes coordinates and keypoints into discrete tokens, designs JSON-style prompts, and trains a GPT-2 base decoder to maximize the next-token likelihood . Experiments show superior textual coherence and layout realism compared with GPT-3.5 baselines, and demonstrate the potential of long-form video prior to guide controllable diffusion processes and assist filmmaking, highlighting Storyboard20K as a valuable resource for future multi-modal video generation research.

Abstract

Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with consistent IDs, bounding boxes, and whole body keypoints. In this way, long-form videos can be represented by a set of tokens and be learned via generative pre-training. Experimental results validate that our approach has great potential for learning long-form video prior. Code and data will be released at \url{https://github.com/showlab/Long-form-Video-Prior}.
Paper Structure (20 sections, 1 equation, 10 figures, 5 tables)

This paper contains 20 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (a) Samples from the proposed Storyboard20K. It consists of scripts, shot-by-shot keyframes, and fine-grained annotations (bounding boxes and whole body keypoints) of characters and film sets. (b) The proposed approach. Instead of modeling in pixel space, we propose to represent movies as sequences of tokens that can be jointly learned with the script via generative pre-training.
  • Figure 1: Additional storyboard samples.
  • Figure 2: Annotated samples (part of a storyboard) of the proposed Storyboard20K. Our dataset involves three main annotations, i.e., (i) character-centric (whole body keypoints and bounding boxes with consistent IDs), (ii) film-set-centric (bounding boxes), and (iii) summative (texts) annotations. It also includes condensed (as illustrated in Fig. \ref{['fig:teaser']}) or shot-by-shot descriptions.
  • Figure 2: Indices of sampled keypoints.
  • Figure 3: Summative analysis. (a) and (b) are the statistics of the top 10 genres and emotions. (c) is the visualization of the most frequent film sets (a subset of 300 categories) using a word cloud.
  • ...and 5 more figures