Live Video Captioning
Eduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-Sastre
TL;DR
This work defines Live Video Captioning (LVC), an online, causal variant of dense video captioning that must generate captions from streaming video with partial observations. It proposes a deformable-transformer–based online model with temporal filtering to predict captions and event boundaries from video segments of length $\Delta t$, using Hungarian matching and a multi-head prediction scheme. To evaluate online performance, the authors introduce the Live Score (LS) online metric and variants (wLS, hLS, whLS) that track caption quality over time using standard scorers (e.g., METEOR, BLEU4, ROUGE-L) while accounting for false positives and temporal history. Experiments on ActivityNet Captions show that LVC achieves superior online performance compared to offline state-of-the-art methods when evaluated with LS, and an evaluation toolkit is made publicly available. The work advances practical live video understanding with implications for accessibility, surveillance, and robotics, and points to future directions in memory-based caption refinement and explainability.
Abstract
Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.
