Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
TL;DR
This work introduces progress-aware video frame captioning, a middle ground between image and video captioning that aims to generate temporally fine-grained, frame-specific descriptions reflecting action progression. It proposes ProgressCaptioner, a two-stage framework that first learns frame-pair captioning and then extends to full frame sequences using auto-generated supervision via progression-detection and caption-matching critics, enabling training with 2-to-$T$ frame inputs. To support training and evaluation, the FrameCap dataset and the FrameCapEval benchmark are created, leveraging multiple VLMs to generate pseudo labels and employing automatic and human-in-the-loop assessments. Empirical results show substantial gains over open-source and competitive proprietary models on frame-level progression tasks, with clear benefits for applications like keyframe selection and enhanced video understanding. The work provides a scalable approach to temporally precise video captioning and offers resources to foster further research in this area, while acknowledging limitations related to auto-label noise and longer sequence handling.
Abstract
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
