Table of Contents
Fetching ...

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova

TL;DR

This work proposes an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames, which shows excellent performance compared to both offline and online methods, and uses 20\% less compute.

Abstract

Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

TL;DR

This work proposes an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames, which shows excellent performance compared to both offline and online methods, and uses 20\% less compute.

Abstract

Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Our online dense video captioning and event localization model produces rich and granular descriptions in a streaming mode, without access to the future video content. A key difference to our design is the factorized autoregressive decoding, which effectively leverages prior context to generate localized descriptions that are temporally aligned with the video. This allows the model to produce comprehensive dense captions, avoiding duplications and including the option to produce more than one output at a time, or no outputs, if applicable.
  • Figure 2: The model outputs dense captions for long videos, which are generated much more frequently than the ground truth and are more detailed and specific. The model automatically determines when no output is needed (e.g., when no activities are present or the caption is redundant) and is able to produce outputs in an online fashion, without requiring future video frames, or the entire video.
  • Figure 3: Model architecture overview. The model consists of multiple decoders which are responsible for captioning video segments. The autoregressive transformer models the video temporally, conditioned on the previous feature representation, providing a higher level of abstraction and a longer-range temporal modeling. Each local decoder is able to "see" information of features before the current segment, thus understanding the current events in context to prior ones. This allows a more detailed and localized descriptions per video corresponding with where the activities occur in the video. The required output format is 'start' of segment token $<$S$>$, 'end' token $<$E$>$ and a caption. The text shown at the top per decoder is provided only during training.
  • Figure 4: Example of standard, global cross-segment masks, and the causal and segment-wise mask we explore for training here.
  • Figure 5: Visualizations of the model outputs. Top: samples from the video sequence. Middle: The ground truth. Bottom: Our model predictions.