Table of Contents
Fetching ...

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman

TL;DR

This work introduces progress-aware video frame captioning, a middle ground between image and video captioning that aims to generate temporally fine-grained, frame-specific descriptions reflecting action progression. It proposes ProgressCaptioner, a two-stage framework that first learns frame-pair captioning and then extends to full frame sequences using auto-generated supervision via progression-detection and caption-matching critics, enabling training with 2-to-$T$ frame inputs. To support training and evaluation, the FrameCap dataset and the FrameCapEval benchmark are created, leveraging multiple VLMs to generate pseudo labels and employing automatic and human-in-the-loop assessments. Empirical results show substantial gains over open-source and competitive proprietary models on frame-level progression tasks, with clear benefits for applications like keyframe selection and enhanced video understanding. The work provides a scalable approach to temporally precise video captioning and offers resources to foster further research in this area, while acknowledging limitations related to auto-label noise and longer sequence handling.

Abstract

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.

Progress-Aware Video Frame Captioning

TL;DR

This work introduces progress-aware video frame captioning, a middle ground between image and video captioning that aims to generate temporally fine-grained, frame-specific descriptions reflecting action progression. It proposes ProgressCaptioner, a two-stage framework that first learns frame-pair captioning and then extends to full frame sequences using auto-generated supervision via progression-detection and caption-matching critics, enabling training with 2-to- frame inputs. To support training and evaluation, the FrameCap dataset and the FrameCapEval benchmark are created, leveraging multiple VLMs to generate pseudo labels and employing automatic and human-in-the-loop assessments. Empirical results show substantial gains over open-source and competitive proprietary models on frame-level progression tasks, with clear benefits for applications like keyframe selection and enhanced video understanding. The work provides a scalable approach to temporally precise video captioning and offers resources to foster further research in this area, while acknowledging limitations related to auto-label noise and longer sequence handling.

Abstract

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.

Paper Structure

This paper contains 23 sections, 20 figures, 4 tables.

Figures (20)

  • Figure 1: We propose progress-aware video frame captioning (bottom), which aims to generate a sequence of captions that capture the temporal dynamics within a video. Unlike traditional image and video captioning (top) that focus on broad event-level descriptions, our task delves into the detailed, progressive dynamics of an action, necessitating precise, temporally fine-grained capabilities. Blue text highlights how the progress-aware captions build successively on the earlier content to highlight what is changing.
  • Figure 2: Use cases of video frame captioning: finer-grained captions enable detailed, step-by-step guidance for daily tasks.
  • Figure 3: Issues of existing VLMs in video frame captioning: (1) Lack of temporal granularity. See captions for frames 2 and 3, produced by Gemini-1.5-Pro (row 2), which fail to distinguish subtle differences between the frames. (2) Temporal hallucination. See frame 2's caption produced by GPT-4o (row 1), which inaccurately suggests progression that is not visible.
  • Figure 4: Captioning outcomes using Gemini-1.5-Pro reid2024gemini.
  • Figure 5: Framework of ProgressCaptioner, designed in two stages. In Stage-I, we prepare frame pairs and generate corresponding caption pairs using multiple VLMs. Each pair undergoes our designed progression detection and caption matching evaluations, to decide if they are selected for model supervised fine-tuning or rejected, with the latter contributing to preference data to aid in model preference learning. The Stage-I model training then proceeds using this collected data. In Stage-II, the trained stage-I model labels frame sequences with a two-frame sliding window, in conjunction with other VLMs. These sequences are again assessed through progression detection and caption matching to classify them as selected or rejected. All collected data from both stages contribute to the final training of ProgressCaptioner.
  • ...and 15 more figures