Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Antoine Yang; Arsha Nagrani; Paul Hongsuck Seo; Antoine Miech; Jordi Pont-Tuset; Ivan Laptev; Josef Sivic; Cordelia Schmid

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

TL;DR

The paper tackles dense video captioning by introducing Vid2Seq, a unified multi-modal sequence-to-sequence framework that generates a single token sequence combining event descriptions with temporal boundaries. It pretrains on large-scale unlabeled narrated videos by turning transcribed sentences into pseudo event captions and boundaries, using generative and denoising objectives to learn cross-modal dependencies. The approach yields state-of-the-art results on YouCook2, ViTT, and ActivityNet Captions, and generalizes to video paragraph and clip captioning, as well as strong performance in few-shot settings. The work demonstrates the value of large-scale, weak supervision for dense video understanding and provides a foundation for extending to temporally-grounded video question answering and action localization.

Abstract

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html.

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 6 figures, 16 tables)

This paper contains 31 sections, 1 equation, 6 figures, 16 tables.

Introduction
Related Work
Method
Model
Sequence construction.
Architecture.
Training
Pretraining on untrimmed narrated videos
Downstream task adaptation
Experiments
Experimental setup
Ablation studies
Comparison to the state of the art
Few-shot dense video captioning
Qualitative examples
...and 16 more sections

Figures (6)

Figure 1: Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in the video by generating a single sequence of tokens (right). This ability is enabled by large-scale pretraining on unlabeled narrated videos (left).
Figure 2: Vid2Seq model overview. We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video. In detail, all input video frames $x$ and the transcribed speech sequence $y$ are first processed with a Visual Encoder $f$ (a frozen Spatial Encoder $f^s$ followed by a Temporal Encoder $f^t$) and a Text Encoder $g$ (a Token Embedder $g^s$ followed by a Transformer Encoder $g^t$), respectively. Then the Text Decoder $h$ (composed of a Token Embedder $h^s$, a Transformer Encoder $h^t$ and a Language Modeling Head $h^l$) autoregressively generates the output event sequence $z$ by cross-attending to the visual and speech embeddings $x^{t}$ and $y^{t}$.
Figure 3: Pretraining tasks. To train Vid2Seq on unlabeled narrated videos, we design two pretraining objectives. Top: generative objective, given visual inputs $x$ only, the task is to generate the transcribed speech sequence $y$. Bottom: denoising objective, given visual inputs $x$ and the corrupted speech sequence $\tilde{y}$, the task is to generate the sequence of recovered speech segments $\Bar{y}$.
Figure 4: Example of dense event captioning predictions of Vid2Seq on ActivityNet Captions validation set, compared with ground-truth.
Figure 5: Examples of dense event captioning predictions of Vid2Seq on the validation set of YouCook2, compared with ground-truth.
...and 1 more figures

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

TL;DR

Abstract

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)