Table of Contents
Fetching ...

Streaming Dense Video Captioning

Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

TL;DR

This work tackles dense video captioning under streaming constraints by enabling causal, frame-by-frame processing with a fixed-size memory of tokens. It introduces a clustering-based memory module that summarizes past visual tokens into $K$ centers, keeping computation bounded as the video grows, and a streaming decoding scheme that emits captions at decoding points while reusing earlier predictions as context. The approach yields significant gains over prior non-streaming methods on ActivityNet, YouCook2, and ViTT and generalizes across backbones such as $GIT$ and $Vid2Seq$, with additional benefits for paragraph captioning. The result is a practically impactful framework enabling live or long-duration video understanding with richer, temporally localized captions.

Abstract

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.

Streaming Dense Video Captioning

TL;DR

This work tackles dense video captioning under streaming constraints by enabling causal, frame-by-frame processing with a fixed-size memory of tokens. It introduces a clustering-based memory module that summarizes past visual tokens into centers, keeping computation bounded as the video grows, and a streaming decoding scheme that emits captions at decoding points while reusing earlier predictions as context. The approach yields significant gains over prior non-streaming methods on ActivityNet, YouCook2, and ViTT and generalizes across backbones such as and , with additional benefits for paragraph captioning. The result is a practically impactful framework enabling live or long-duration video understanding with richer, temporally localized captions.

Abstract

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.
Paper Structure (18 sections, 2 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparing our streaming model (b) to conventional global models (a). Conventional global models encode the entire video at once, and produce captions for all events at the end. Our model encodes images frame-by-frame, uses them to update a running memory, and predicts captions sequentially.
  • Figure 2: Illustration of our framework. Each frame is passed through an image encoder, one at a time. A memory model, based on clustering, maintains compressed visual features from the beginning up to the current frame. At certain frames, denoted as "decoding points", we decode the representations from our memory into captions and their timestamps. Earlier text predictions, if available, are also passed as a prefix to the language decoder for the following decoding points. Our model can run on videos of arbitrary length, as the memory has a constant size, and can also output predictions before processing the whole video.
  • Figure 3: Illustration of our clustering-based memory module. The current memory tokens are shown by blue squares. At each time step, the memory tokens evolve by integrating information from the incoming tokens (gray squares), using K-means iterations to produce the updated memory tokens (green circles).
  • Figure 4: Decoding point supervision in training. A decoding point, $d_i$, can be at any frame. At each point, we take the memory features, $\mathbf{M}_{d_i}$, and predict all events that have finished before $d_i$, and are not in the prefix $\mathbf{p}$. Therefore, the union between the prefix and the prediction target covers all events finished before it.
  • Figure 5: Qualitative results on ActivityNet validation. Results from the ground truth (top), the baseline (middle), and our model (bottom). We show outputs from two decoding points in green and blue respectively. Our model captures more details than the baseline.