Table of Contents
Fetching ...

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

TL;DR

PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs and achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse.

Abstract

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

TL;DR

PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs and achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse.

Abstract

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.
Paper Structure (23 sections, 4 equations, 13 figures, 5 tables)

This paper contains 23 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Comparison of reasoning behaviors among baseline models, baseline models(+GRPO) and PaLMR on a visual reasoning sample. As shown, PaLMR demonstrates perception-aware reasoning and produces faithful answers by process-level perception alignment, addressing the hallucinated reasoning issue in prior models.
  • Figure 2: Overview of the proposed PaLMR framework. The model adopts a two-layer architecture: (a) the Perception-Aligned Data Layer (PaDLayer) builds process-aware multimodal data with structured pseudo ground truths and verifiable visual facts; and (b) the Process-Aligned Optimization Layer (PaOLayer) integrates perception-aware, answer, and format rewards into GRPO to enforce visually faithful and logically coherent reasoning.
  • Figure 3: Model-Human alignment ratio in identifying visual perception errors. The evaluation assesses visual perception errors in 100 randomly sampled responses generated from the Geo3K dataset with Qwen2.5-VL-7B, and using Qwen2.5-32B, Qwen3-30B as judgement model.
  • Figure 4: Data distribution after our PaDLayer data filtering, 19 distinct sub-domain is selected and 4728 samples are finally left and used for training dataset generation.
  • Figure 5: Several samples selected from different domain provide qualitative results between baseline models and our PalMR.
  • ...and 8 more figures