Table of Contents
Fetching ...

MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Yichong Huang, Xiaoyu Shen, Xipeng Qiu, See-Kiong Ng

TL;DR

This work tackles the challenge of noisy, inconsistent alt-text annotations and limited target-quality data by introducing Multifaceted Cross-modal Direct Preference Optimization (MCM-DPO). MCM-DPO learns preferences across seven cross-modal dimensions (single, pairwise, and multi-preference) spanning alt-text, context, and image, organized into three modules and integrated into a unified objective; it is trained on two large social-media-derived datasets, TAlt and PAlt, along with a 202K SFT pretraining set. Empirical results show MCM-DPO consistently outperforms supervised fine-tuning and standard DPO on Twitter and Pinterest alt-text tasks, achieving state-of-the-art performance and reducing multimodal hallucinations. The paper also analyzes training paradigms and component contributions, and releases code and datasets to support further research in robust alt-text generation in diverse domains.

Abstract

The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO

MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation

TL;DR

This work tackles the challenge of noisy, inconsistent alt-text annotations and limited target-quality data by introducing Multifaceted Cross-modal Direct Preference Optimization (MCM-DPO). MCM-DPO learns preferences across seven cross-modal dimensions (single, pairwise, and multi-preference) spanning alt-text, context, and image, organized into three modules and integrated into a unified objective; it is trained on two large social-media-derived datasets, TAlt and PAlt, along with a 202K SFT pretraining set. Empirical results show MCM-DPO consistently outperforms supervised fine-tuning and standard DPO on Twitter and Pinterest alt-text tasks, achieving state-of-the-art performance and reducing multimodal hallucinations. The paper also analyzes training paradigms and component contributions, and releases code and datasets to support further research in robust alt-text generation in diverse domains.

Abstract

The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO

Paper Structure

This paper contains 37 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between alt-text and image caption. In alt-text, "Elsa" comes from the post-text (or context). Findings: Alt-text is concise and context-dependent, whereas caption provides detailed descriptions of the image.
  • Figure 2: Results of different models on our PAlt evaluation datasets. SFT: supervised fine-tuned based on LLaVA on 202K human-annotated alt-text samples; DPO and MCM-DPO: preference optimizations based on the SFT model using the 8K preference dataset.
  • Figure 3: The framework of the MCM-DPO. Symbols $y$, $c$, and $m$ represent alt-text, post-text (context), and image, respectively. The subscript '$_w$' indicates the chosen one (e.g.,$c_w$ denotes the chosen post-text), while '$_l$' denotes the rejected one (e.g., $c_l$ denotes the rejected post-text). By default, y, c, and m refer to $y_w$, $c_w$, and $m_w$.
  • Figure 4: Training paradigms explored in this work. S1 and S2 denote the 'Stage 1: supervised fine-tuning' and 'Stage 2: preference optimization', respectively.
  • Figure 5: Dataset construction process. $c$ and $m$ represent the post-text (context) and image, respectively. $y_w$ and $y_l$ are the chosen and rejected alt-text, respectively.
  • ...and 3 more figures