Table of Contents
Fetching ...

The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun

TL;DR

The paper investigates why Multimodal Reward Models (MM-RMs) fail to generalize to out-of-distribution data due to unimodal spurious correlations, particularly text-only shortcuts. It introduces Shortcut-aware MM-RM learning, which uses a text-only shortcut proxy and a dynamic sample-reweighting scheme governed by the Shortcut-Failure Coefficient (SFC) to emphasize samples requiring true multimodal integration; the trained model is then deployed via a disentangled inference approach. Across three multimodal preference datasets, the method yields substantial cross-distribution gains (e.g., average o.o.d. accuracy rising from $68.1$ to $78.5$) and robust downstream performance, with scalable improvements across 2B, 4B, and 8B models. The work provides a general, modality-agnostic framework for debiasing MM-RMs, improving reliability and safety in multimodal alignment and offering a path to addressing similar spurious correlations in related settings.

Abstract

Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.

The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

TL;DR

The paper investigates why Multimodal Reward Models (MM-RMs) fail to generalize to out-of-distribution data due to unimodal spurious correlations, particularly text-only shortcuts. It introduces Shortcut-aware MM-RM learning, which uses a text-only shortcut proxy and a dynamic sample-reweighting scheme governed by the Shortcut-Failure Coefficient (SFC) to emphasize samples requiring true multimodal integration; the trained model is then deployed via a disentangled inference approach. Across three multimodal preference datasets, the method yields substantial cross-distribution gains (e.g., average o.o.d. accuracy rising from to ) and robust downstream performance, with scalable improvements across 2B, 4B, and 8B models. The work provides a general, modality-agnostic framework for debiasing MM-RMs, improving reliability and safety in multimodal alignment and offering a path to addressing similar spurious correlations in related settings.

Abstract

Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.

Paper Structure

This paper contains 36 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Cross-distribution evaluations of three kinds of RM, where the diagonal elements represent i.i.d. tests, while the off-diagonal elements represent o.o.d. tests. (a) Standard MM-RM has significant room for improvement in certain o.o.d. test scenarios. (b) Text-only shortcuts achieve high accuracy under i.i.d. tests but demonstrate poor generalization in o.o.d. scenarios. (c) Our algorithm demonstrates substantial improvements in generalization, with average accuracy across six o.o.d. scenarios increasing from 68.1 to 78.5.
  • Figure 2: Accuracy and Shortcut-Failure Degradation of MM-RMs in various o.o.d. scenarios. $\mathcal{S}^e \rightarrow \mathcal{S}^{e'}$ indicates that the MM-RM trained on $\mathcal{S}_{train}^e$ is tested on $\mathcal{S}_{test}^{e'}$. MM-RMs show consistently poor performance when text-only shortcuts become ineffective..
  • Figure 3: The correlation between MM-RM reward scores and text-only RM scores, using two o.o.d. test scenarios as examples.
  • Figure 4: An empirically observed text-only shortcuts in POVID data, which generates rejected responses by injecting hallucinations into standard answers, often introducing spurious correlations between query-irrelevant descriptive elements and bad responses.
  • Figure 5: Text-only RMs achieve high accuracy on both the full test set and a length-balanced subset under i.i.d. scenarios. Besides length bias, there still remain fine-grained unimodal spurious correlations that can be learned by models.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 3.1
  • Definition 4.1