VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Ji Soo Lee; Jongha Kim; Jeehye Na; Jinyoung Park; Hyunwoo J. Kim

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim

TL;DR

VidChain tackles dense video captioning by decomposing the task into a Chain-of-Tasks (CoTasks) of sequential sub-tasks and by aligning model outputs with evaluation metrics via Metric-based Direct Preference Optimization (M-DPO). CoTasks enables multi-turn reasoning on sub-tasks such as event counting, timing, and captioning, while M-DPO provides fine-grained, metric-driven supervision across both intermediate and final outputs. The approach yields consistent improvements across two VideoLLMs on Dense Video Captioning and Temporal Video Grounding benchmarks, with notable gains in SODA_c, METEOR, CIDEr, recall, and mIoU. The results demonstrate that metric-guided, task-structured supervision can significantly enhance fine-grained video understanding and generalize to related grounding tasks, offering a practical path for improving VideoLLMs without extensive human annotation. Overall, VidChain advances the state-of-the-art in fine-grained video understanding by integrating structured reasoning with metric-aligned optimization.

Abstract

Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \url{https://github.com/mlvlab/VidChain}.

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

TL;DR

Abstract

Paper Structure (47 sections, 14 equations, 11 figures, 14 tables)

This paper contains 47 sections, 14 equations, 11 figures, 14 tables.

Introduction
Related Works
Dense Video Captioning.
Video Large Language Models.
Direct Preference Optimization.
Method
Preliminaries
Dense Video Captioning.
Direct Preference Optimization.
Chain-of-Tasks
Objective Decomposition.
Training data construction for CoTasks.
Inference pipeline of CoTasks.
Metric-based Direct Preference Optimization
Training data construction for M-DPO.
...and 32 more sections

Figures (11)

Figure 2: Qualitative example of Dense Video Captioning. Predictions of baseline VideoLLM (Single-turn), VideoLLM+CoTasks, and VidChain (CoTasks + M-DPO) are illustrated. Red and green highlights denote erroneous and accurate predictions, respectively. Visualization is done on ActivityNet validation set with VTimeLLM in $\mathcal{P}_{c \rightarrow t}$ path.
Figure 3: CoTasks prompt template for DVC.
Figure 4: Margin of the likelihood ratio between preferred and dispreferred responses with $\mathcal{L}_{\text{DPO}}$, $\mathcal{L}_{\text{M-DPO}^{-}}$, and $\mathcal{L}_{\text{M-DPO}}$. $x$-axis stands for training epochs.
Figure 5: Qualitative examples of DVC prediction with VideoLLaMA2 on ActivityNet.
Figure 6: Qualitative examples of DVC prediction with VTimeLLM on ActivityNet.
...and 6 more figures

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

TL;DR

Abstract

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)