COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
Peizheng Guo, Jingyao Wang, Wenwen Qiang, Jiahuan Zhou, Changwen Zheng, Gang Hua
TL;DR
The paper investigates hallucinations in Multimodal LLMs and shows that outcome-based rewards in GRPO can induce spurious background-dependent reasoning. It then introduces COPO, a causal-oriented policy optimization framework that computes a causal completeness reward by jointly assessing token-level sufficiency and necessity, and integrates it into the GRPO advantage to emphasize causally grounded tokens. Through extensive experiments on CHAIR, POPE, and other benchmarks across multiple open-source MLLMs, COPO demonstrates reduced hallucinations, improved grounding, and better qualitative outputs, validating its effectiveness and plug-in nature. The work highlights a principled path to align multimodal generation with causal evidence, potentially improving reliability in vision-language systems.
Abstract
Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.
