Table of Contents
Fetching ...

Grounded Video Caption Generation

Evangelos Kazakos, Cordelia Schmid, Josef Sivic

TL;DR

Results of the VideoGround model set the state of the art for the new task of grounded video caption generation, where the objects in the caption are grounded in the video via temporally consistent bounding boxes with coherent natural language labels.

Abstract

We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. First, we present a task definition and a manually annotated test dataset for this task, referred to as GROunded Video Caption Generation (GROC). Second, we introduce a large-scale automatic annotation method leveraging an existing model for grounded still image captioning together with an LLM for summarising frame-level captions into temporally consistent captions in video. Furthermore, we prompt the LLM to track by language -- classifying noun phrases from the frame-level captions into noun phrases of the video-level generated caption. We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels. Third, we introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset. Finally, results of our VideoGround model set the state of the art for the new task of grounded video caption generation. We perform extensive ablations and demonstrate the importance of key technical contributions of our model.

Grounded Video Caption Generation

TL;DR

Results of the VideoGround model set the state of the art for the new task of grounded video caption generation, where the objects in the caption are grounded in the video via temporally consistent bounding boxes with coherent natural language labels.

Abstract

We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. First, we present a task definition and a manually annotated test dataset for this task, referred to as GROunded Video Caption Generation (GROC). Second, we introduce a large-scale automatic annotation method leveraging an existing model for grounded still image captioning together with an LLM for summarising frame-level captions into temporally consistent captions in video. Furthermore, we prompt the LLM to track by language -- classifying noun phrases from the frame-level captions into noun phrases of the video-level generated caption. We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels. Third, we introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset. Finally, results of our VideoGround model set the state of the art for the new task of grounded video caption generation. We perform extensive ablations and demonstrate the importance of key technical contributions of our model.

Paper Structure

This paper contains 18 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The GROunded video Caption generation task. Three frames from an example video from our new manually annotated GROC dataset of natural language descriptions grounded with temporally consistent bounding boxes in videos.
  • Figure 2: A method for automatic annotation of spatio-temporally grounded captions. In the first stage (left), we apply a still-image grounded caption generation model on individual video frames producing temporally inconsistent outputs. In the second stage (middle), the captions from individual frames are aggregated using an LLM into a single video-level caption describing the most salient actions/objects in the video. Third (right), individual frame-level phrases and bounding boxes are associated over time into a temporally consistent labelling of object bounding boxes over the video.
  • Figure 3: An overview of our VideoGround grounded caption generation model. The key technical innovations enabling grounded caption generation in video are outlined by dashed red rectangles and include: (i) spatio-temporal adapters; (ii) the bounding box decoder and (iii) the temporal objectness head.
  • Figure 4: Qualitative examples showing the predictions of VideoGround on two videos. The examples showcase three important properties of VideoGround: i) it is able to produce video-level natural language captions describing the main action in the video, ii) it can ground multiple objects, iii) it produces spatio-temporally consistent predictions. More qualitative results can be found in Figures \ref{['fig:qualitative_supp1']} and \ref{['fig:qualitative_supp2']} in the Appendix. In the third and fifth example in Figure \ref{['fig:qualitative_supp2']}, we demonstrate another property of VideoGround: iv) temporal objectness models objects that temporally leave the frame as it does not predict a bounding box for the hand when the hand disappears.
  • Figure 5: Word cloud for (a) HowToGround dataset and (b) GROC test set.
  • ...and 4 more figures