Table of Contents
Fetching ...

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He

Abstract

Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

Abstract

Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

Paper Structure

This paper contains 41 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: (a) Computing the Alignment Score: Prompting the off-the-shelf MLLM to output “yes” or “no” token for the paired text-image, and then calculating the alignment score based on “yes” and “no” token logits. (b) Constructing Preference Data: Constructing pairwise or listwise preference data based on the calculated alignment scores. (c) Retrieval Comparison: Through preference alignment, the retrieval model can capture fine-grained distinctions between retrieved images under the same query.
  • Figure 2: The Training Schema of the Proposed MAPLE. We first prepare the candidate set for each anchor sample through a series of dataset processing operations. In the training stage, we leverage an off-the-shelf MLLM as a reward model to dynamically calculate the alignment scores and subsequently construct the preference data. We extract the embeddings from the policy model (MLLM-based retriever) and align them with the preference data through the RPA loss. This schema primarily illustrates the pairwise training paradigm.
  • Figure 3: The Modality Gap Comparison Between CLIP and Qwen2-VL. The gap is computed on the MMVP dataset. For the CLIP model, we use cosine similarity to construct the similarity distribution. For the Qwen2-VL model, we use the alignment score to construct the similarity distribution. $W_{\text{dist-gap}}$ indicates lower values are better, $W_{\text{disc-gap}}$ indicates higher values are better.
  • Figure 4: Example of Prompt and Response for Generating Comparative Descriptions. An example illustrating a prompt for generating comparative descriptions for a pair of images and the corresponding JSON response generated by the Qwen2.5-VL-72B.
  • Figure 5: Impact of Varying the Hyperparameter $\lambda$ on Retrieval Performance. The mean of the y-axis represents the average performance across Text and Image retrieval tasks.
  • ...and 4 more figures