Table of Contents
Fetching ...

Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs

Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Jiaqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, Huaxiu Yao

TL;DR

The paper addresses LVLMs' tendency to follow flawed, unstable reasoning paths despite possessing the requisite knowledge, evidenced by large Pass@K versus Pass@1 gaps. It introduces Path-Select Optimization (PSO), a two-stage post-training method that first uses GRPO to initialize structured reasoning traces and then conducts online, on-policy path optimization with a thinking reward and a Negative Replay Memory to suppress brittle trajectories. By jointly optimizing process-level reasoning quality and final answers through online preference signals and memory retrieval, PSO continually refines reasoning paths toward stability and correctness. Experiments across MathVista, MathVerse, MMMU, and other benchmarks show that PSO yields substantial gains in both reasoning accuracy and stability, advancing reasoning-level alignment for multimodal systems.

Abstract

We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at https://github.com/aiming-lab/PSO.

Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs

TL;DR

The paper addresses LVLMs' tendency to follow flawed, unstable reasoning paths despite possessing the requisite knowledge, evidenced by large Pass@K versus Pass@1 gaps. It introduces Path-Select Optimization (PSO), a two-stage post-training method that first uses GRPO to initialize structured reasoning traces and then conducts online, on-policy path optimization with a thinking reward and a Negative Replay Memory to suppress brittle trajectories. By jointly optimizing process-level reasoning quality and final answers through online preference signals and memory retrieval, PSO continually refines reasoning paths toward stability and correctness. Experiments across MathVista, MathVerse, MMMU, and other benchmarks show that PSO yields substantial gains in both reasoning accuracy and stability, advancing reasoning-level alignment for multimodal systems.

Abstract

We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at https://github.com/aiming-lab/PSO.

Paper Structure

This paper contains 20 sections, 4 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: LVLMs can output coherent yet flawed reasoning, but for the same question may produce correct chains, revealing that these errors arise from unstable reasoning rather than inability.
  • Figure 2: Overview of PSO. Stage 1 (Answer Reward-Guided Reasoning Initialization): starting from a base LVLM, GRPO with accuracy and format rewards on multimodal inputs initializes structured step-by-step reasoning. Stage 2 (Online Path Optimization): for each query, the model samples multiple think–answer trajectories, a thinking reward model ranks them, and low-quality paths are stored in Negative Replay Memory as hard negatives for online, on-policy preference optimization. This closed loop prunes brittle paths and shifts probability toward stable, reliable reasoning trajectories.
  • Figure 3: Comparison of Pass@k performance on MMBench and MMBench-Star for Qwen2.5VL-7B-Instruct before and after PSO.
  • Figure 4: Qwen2.5vl-Instruct-7B thinking rewards on MMMU dataset. LS=Logical Soundness, EI = Error Identification, CR = Correct Reasoning, LC = Language Consistency, RD = Redundancy.
  • Figure 5: Reasoning reward distribution on the MMMU dataset with different methods. Orange denotes the rewards of reasoning paths sampled from Qwen2.5vl-Instruct-7B, while green denotes those from Qwen2.5vl-Instruct-7B + PSO.
  • ...and 7 more figures