ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Daechul Ahn; Yura Choi; San Kim; Youngjae Yu; Dongyeop Kang; Jonghyun Choi

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

TL;DR

VLMMs face modality misalignment and verbosity issues during iterative preference optimization. ISR-DPO mitigates these challenges by incorporating self-retrospective visual context into the self-judge, grounding preferences in video content. Through nine iterative cycles of DPO, ISR-DPO achieves state-of-the-art results on both in-domain and out-domain video QA benchmarks while reducing verbosity hallucinations. The work provides extensive ablations, human alignment analysis, and open-source code/data to foster further research in multimodal alignment.

Abstract

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 13 figures, 5 tables)

This paper contains 32 sections, 5 equations, 13 figures, 5 tables.

Introduction
Related Work
Aligning large multimodal models for videos.
Iterative preference optimization.
Verbosity bias in preference optimization.
Iterative Self-Retrospective DPO
Iterative DPO in VLMM
Initial model.
Preference modeling.
Iterative training.
Self-Retrospective Preference Modeling
Experiments
Experimental Setup
Dataset details.
Training details.
...and 17 more sections

Figures (13)

Figure 1: Illustration of the proposed ISR-DPO. During iterative direct preference optimization (DPO) in VLMM, we select preferences from responses based on not only video content but also visual context $c_t$, i.e., detailed video description, to ensure preferences are grounded in video information. Specifically, we enhance the context in the self-retrospective manner by leveraging context $c_{t-1}$ generated in previous iteration, a process we call self-retrospective preference modeling. Red indicates irrelevant responses, while blue indicates accurate, visually-grounded responses.
Figure 2: Example of verbosity hallucination within iterative preference modeling cycle for VLMM. At the 1st iteration, the response is concise and visually grounded (in blue). By the 9$th$ iteration, the response elaborates further, referencing explicit text overlays in the video. However, it starts to include irrelevant details and assumptions as well, leading to verbosity hallucination highlighted in red.
Figure 3: Overview of self-retrospective Direct Preference Optimization (DPO). Each iteration of ISR-DPO involves three stages: 1) After training iteration $t$, the latest updated VLMM ($\pi_{\theta^{t}}$) generates two different responses $y_1$ and $y_2$ for the given video $V$ and instruction $x$. In addition, a visual description, i.e., visual context, is generated through self-retrospection, providing the necessary input for the next stage, as indicated by the black dotted line. 2) Using the information generated in the previous stage, the model ($\pi_{\theta^{t}}$) compares its responses($y_1$ and $y_2$) and classifies the preferred response $y_w$ and the rejected response $y_l$. 3) Then, the VLMM ($\pi_{\theta^{t}}$) is optimized using DPO to update the parameters to $\pi_{\theta^{t+1}}$.
Figure 4: Length analysis of preference dataset during iterative DPO. (a) Average (Avg.) word length of chosen response $|y_{w}|$ in preference dataset $D_{t}^{\text{pref}}$ across DPO iterations. Self-rewarding results in longer responses compared to the ISR-DPO. (b) Ratio of the word lengths of chosen responses ($|y_{w}|$) to rejected responses ($|y_{l}|$). ISR-DPO consistently maintains a lowered ratio compared to the self-rewarding, indicating reduced response length after optimized. '# DPO iteration' means the number of DPO iterations.
Figure 5: Average (Avg.) response word length between self-rewarding and ISR-DPO on various video question answering benchmarks.ISR-DPO yields compact and concise responses at the same iteration compared to self-rewarding.
...and 8 more figures

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

TL;DR

Abstract

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Authors

TL;DR

Abstract

Table of Contents

Figures (13)