Table of Contents
Fetching ...

Live Video Captioning

Eduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-Sastre

TL;DR

This work defines Live Video Captioning (LVC), an online, causal variant of dense video captioning that must generate captions from streaming video with partial observations. It proposes a deformable-transformer–based online model with temporal filtering to predict captions and event boundaries from video segments of length $\Delta t$, using Hungarian matching and a multi-head prediction scheme. To evaluate online performance, the authors introduce the Live Score (LS) online metric and variants (wLS, hLS, whLS) that track caption quality over time using standard scorers (e.g., METEOR, BLEU4, ROUGE-L) while accounting for false positives and temporal history. Experiments on ActivityNet Captions show that LVC achieves superior online performance compared to offline state-of-the-art methods when evaluated with LS, and an evaluation toolkit is made publicly available. The work advances practical live video understanding with implications for accessibility, surveillance, and robotics, and points to future directions in memory-based caption refinement and explainability.

Abstract

Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.

Live Video Captioning

TL;DR

This work defines Live Video Captioning (LVC), an online, causal variant of dense video captioning that must generate captions from streaming video with partial observations. It proposes a deformable-transformer–based online model with temporal filtering to predict captions and event boundaries from video segments of length , using Hungarian matching and a multi-head prediction scheme. To evaluate online performance, the authors introduce the Live Score (LS) online metric and variants (wLS, hLS, whLS) that track caption quality over time using standard scorers (e.g., METEOR, BLEU4, ROUGE-L) while accounting for false positives and temporal history. Experiments on ActivityNet Captions show that LVC achieves superior online performance compared to offline state-of-the-art methods when evaluated with LS, and an evaluation toolkit is made publicly available. The work advances practical live video understanding with implications for accessibility, surveillance, and robotics, and points to future directions in memory-based caption refinement and explainability.

Abstract

Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.
Paper Structure (18 sections, 8 equations, 15 figures, 8 tables)

This paper contains 18 sections, 8 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Above: Traditional models of dense video captioning work offline, accessing the whole video to generate the captions. Down: The live video captioning models must generate the captions for the video stream, in an online manner, and working with partial observations of the video.
  • Figure 2: LVC adopts a deformable transformer-based architecture to learn the interaction of different frames of the video, including learnable event queries to capture the significance of the relationship between frames and events. Two prediction heads run in parallel on the query features, leveraging mutual benefits between the two tasks and improving their performance together.
  • Figure 3: Example of caption consolidation for a video segment.
  • Figure 4: The LS metric. It allows for an online and continuous evaluation of a video stream, analyzed every $\Delta t$ seconds. Our metric allows for the integration of any scorer (e.g. METEOR, Bleu4 or Rouge-L) in the online or live evaluation.
  • Figure 5: Operation of the online metric with fixed temporal window history. We observe how that temporal window moves along the video timeline. The window, with size $w = 5$, encompasses the scores that will be considered to compute the score associated with the current instant. The first two slots have been discarded. The diagram has been simplified for ease of understanding, but the calculation of scores for each $\Delta t$ is the same as in the previous scenarios.
  • ...and 10 more figures