Table of Contents
Fetching ...

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei

TL;DR

VistaDPO tackles misalignment and hallucination in Large Video Models by introducing a hierarchical, spatiotemporal Direct Preference Optimization framework. It decomposes alignment into instance-, temporal-, and perceptive-level objectives and provides VistaDPO-7k, a large, richly grounded video-language QA dataset drawn from 14 sources to support fine-grained optimization. The approach yields substantial improvements on video hallucination, QA, and captioning benchmarks, and analyses show enhanced cross-modal representations and robustness to adversarial testing. The work releases code and dataset to advance future research on precise video-language alignment in LVMs.

Abstract

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

TL;DR

VistaDPO tackles misalignment and hallucination in Large Video Models by introducing a hierarchical, spatiotemporal Direct Preference Optimization framework. It decomposes alignment into instance-, temporal-, and perceptive-level objectives and provides VistaDPO-7k, a large, richly grounded video-language QA dataset drawn from 14 sources to support fine-grained optimization. The approach yields substantial improvements on video hallucination, QA, and captioning benchmarks, and analyses show enhanced cross-modal representations and robustness to adversarial testing. The work releases code and dataset to advance future research on precise video-language alignment in LVMs.

Abstract

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

Paper Structure

This paper contains 29 sections, 21 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: (a) Traditional textual DPO overlooks multimodal information, limiting video-language tasks. (b) Existing multimodal DPO methods rely on coarse alignment, missing rich temporal and perceptual details. (c&d) VistaDPO overcomes these limitations with a hierarchical spatiotemporal preference optimization framework, enabling fine-grained video-language alignment and precise reasoning over video dynamics. Here, $y_w$ is the preferred response over $y_l$, and $v_w$ the visual input more likely to produce it than $v_l$.
  • Figure 2: (a) The metadata of VistaDPO-7k highlights its focus on fine-grained video-language tasks, emphasizing temporal ($44\%$) and perceptual ($56\%$) reasoning. $y_l^{ir}$ and $y_l^{re}$ denote the irrelevant and relevant non-preferred responses respectively. (b) VistaDPO introduces a hierarchical spatiotemporal preference optimization framework. Instance ($v^v$) and perceptive ($v^f$) levels align global-to-local semantics with spatial visual features, leveraging both text-relevant and irrelevant rejected responses for robust cross-modal interaction. Temporal ($v^c$) level aligns clip-level semantics with temporal dynamics, enabling precise reasoning across spatial and temporal dimensions.
  • Figure 3: Ablation study of hyperparameters on EventHallusion.
  • Figure 4: T-SNE visualization of representation. (a) Video-LLaVA shows substantial overlap between hallucinated (orange) and non-hallucinated (green) representations. (b) With Hound-DPO, there is no distinct improvement in the separation of the two clusters. (c) With VistaDPO, the representations achieve clear clustering, highlighting its superior discriminative capability.
  • Figure 5: Ablation study of visual non-preferred samples on two video hallucination benchmarks.
  • ...and 8 more figures