Table of Contents
Fetching ...

It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman

TL;DR

This paper formulate the problem of unique captioning: Given multiple clips with the same caption, a new caption is generated for each clip that uniquely identifies it, and proposes Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and uses it to generate unique captions.

Abstract

Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies - where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.

It's Just Another Day: Unique Video Captioning by Discriminative Prompting

TL;DR

This paper formulate the problem of unique captioning: Given multiple clips with the same caption, a new caption is generated for each clip that uniquely identifies it, and proposes Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and uses it to generate unique captions.

Abstract

Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies - where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.

Paper Structure

This paper contains 23 sections, 7 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Standard video captioning breaks the video into smaller clips and considers each clip independently. As a result, it is likely multiple clips from one video will have the same exact caption (a). We introduce Captioning by Discriminative Prompting (CDP), an approach for generating unique captions. CDP considers the set of clips with the same caption (b), and predicts a discriminative prompt (e.g."holding") that allows the clip to be captioned uniquely (c) When a unique caption cannot be found, we advance to the next clip $\blacktriangleright$ to allow unique captioning based on following actions (d).
  • Figure 2: Pipeline for computing the margin at a single timestep for three clips. This example uses $\alpha=2$, so margins are computed for all single prompts and pairs of prompts. (a) and (b) are replaced by a learned network in Sec. \ref{['sec:predprompts']}.
  • Figure 3: Training CDPNet, which aims to predict the similarity between a clip (yellow) and the caption from another clip (green), when conditioning the captioner with a prompt. (a) and (b) show how to compute the similarity between the clip and caption in a shared embedding space, which is used as the training signal. (c) shows CDPNet predicting the similarity only using the video clips and prompt in one forward pass.
  • Figure 3: Ablation on $\alpha$, the maximum number of prompts. The LaViLa VCLM baseline is shown for comparison.
  • Figure 4: Examples of the Unique Captioning Benchmarks, from Egocentric videos (left) and timeloop movies (right). We show 3 sequences from each set of clips -- i.e. video clips with the same caption at T=+0. Subsequent clips are indicated by $\blacktriangleright$. We note the common caption in each case.
  • ...and 10 more figures