Table of Contents
Fetching ...

COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs

Peizheng Guo, Jingyao Wang, Wenwen Qiang, Jiahuan Zhou, Changwen Zheng, Gang Hua

TL;DR

The paper investigates hallucinations in Multimodal LLMs and shows that outcome-based rewards in GRPO can induce spurious background-dependent reasoning. It then introduces COPO, a causal-oriented policy optimization framework that computes a causal completeness reward by jointly assessing token-level sufficiency and necessity, and integrates it into the GRPO advantage to emphasize causally grounded tokens. Through extensive experiments on CHAIR, POPE, and other benchmarks across multiple open-source MLLMs, COPO demonstrates reduced hallucinations, improved grounding, and better qualitative outputs, validating its effectiveness and plug-in nature. The work highlights a principled path to align multimodal generation with causal evidence, potentially improving reliability in vision-language systems.

Abstract

Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.

COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs

TL;DR

The paper investigates hallucinations in Multimodal LLMs and shows that outcome-based rewards in GRPO can induce spurious background-dependent reasoning. It then introduces COPO, a causal-oriented policy optimization framework that computes a causal completeness reward by jointly assessing token-level sufficiency and necessity, and integrates it into the GRPO advantage to emphasize causally grounded tokens. Through extensive experiments on CHAIR, POPE, and other benchmarks across multiple open-source MLLMs, COPO demonstrates reduced hallucinations, improved grounding, and better qualitative outputs, validating its effectiveness and plug-in nature. The work highlights a principled path to align multimodal generation with causal evidence, potentially improving reliability in vision-language systems.

Abstract

Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.

Paper Structure

This paper contains 49 sections, 11 equations, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: Example of hallucination in MLLMs: invent a specific date when the event date in the poster is actually obscured.
  • Figure 2: Motivating results. Both MLLM and LLM are trained via GRPO.
  • Figure 3: SCMs for MLLMs. The solid circles denote observable variables, dashed circles as unobservable variables, black arrows as true correlations, and dashed arrows as spurious correlations.
  • Figure 4: Overview of our causal-oriented policy optimization framework. The upper part is the pipeline of COPO, and the lower part is the calculation process of our proposed causal completeness rewards.
  • Figure 5: Results on text quality evaluation.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 3.1: PNS for MLLMs