The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun
TL;DR
The paper investigates why Multimodal Reward Models (MM-RMs) fail to generalize to out-of-distribution data due to unimodal spurious correlations, particularly text-only shortcuts. It introduces Shortcut-aware MM-RM learning, which uses a text-only shortcut proxy and a dynamic sample-reweighting scheme governed by the Shortcut-Failure Coefficient (SFC) to emphasize samples requiring true multimodal integration; the trained model is then deployed via a disentangled inference approach. Across three multimodal preference datasets, the method yields substantial cross-distribution gains (e.g., average o.o.d. accuracy rising from $68.1$ to $78.5$) and robust downstream performance, with scalable improvements across 2B, 4B, and 8B models. The work provides a general, modality-agnostic framework for debiasing MM-RMs, improving reliability and safety in multimodal alignment and offering a path to addressing similar spurious correlations in related settings.
Abstract
Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
