Table of Contents
Fetching ...

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk

TL;DR

The paper tackles the challenge of improving visual reasoning in vision-language models without relying on costly supervision by introducing Puzzle Curriculum GRPO (PC-GRPO). It combines self-supervised puzzle rewards (PatchFit, Rotation, Jigsaw), a difficulty-aware curriculum, and a Reasoning–Answer Consistency (RAC) monitor to guide post-training and stabilize learning. PC-GRPO demonstrates robust gains across diverse vision-centric benchmarks on Qwen backbones, while also revealing pervasive benchmark noise and offering auditing/remediation strategies. The work provides a practical, scalable path for verifiable RL post-training in VLMs and emphasizes the importance of consistency signals alongside task rewards for downstream performance.

Abstract

Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.

Puzzle Curriculum GRPO for Vision-Centric Reasoning

TL;DR

The paper tackles the challenge of improving visual reasoning in vision-language models without relying on costly supervision by introducing Puzzle Curriculum GRPO (PC-GRPO). It combines self-supervised puzzle rewards (PatchFit, Rotation, Jigsaw), a difficulty-aware curriculum, and a Reasoning–Answer Consistency (RAC) monitor to guide post-training and stabilize learning. PC-GRPO demonstrates robust gains across diverse vision-centric benchmarks on Qwen backbones, while also revealing pervasive benchmark noise and offering auditing/remediation strategies. The work provides a practical, scalable path for verifiable RL post-training in VLMs and emphasizes the importance of consistency signals alongside task rewards for downstream performance.

Abstract

Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.

Paper Structure

This paper contains 32 sections, 9 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Performance of our model against state-of-the-art methods on diverse visual reasoning benchmarks. The chart compares PC-GRPO model (Ours) with strong baselines, including Qwen-2.5-VL-7B base model. Each axis represents a different benchmark. Our method achieves competitive or superior results across the board, demonstrating that the supervision-free puzzle curriculum effectively enhances the model's visual reasoning capabilities. Additionally, the reasoning abilities of PC-GRPO reveal critical levels of noise in popular vision benchmarks. We audit and clean some of these benchmarks (denoted with the _clean suffix) using high performance VLMs. We then benchmark PC-GRPO and existing baselines on the clean subsets.
  • Figure 2: PC-GRPO overcomes fundamental reasoning failures in VLMs When asked a simple visual reasoning question, existing GRPO-tuned models often fail by overthinking irrelevant details, shortcutting to a statistically likely but incorrect answer, or producing a final answer that contradicts their own reasoning trace. PC-GRPO learns to produce a faithful and visually-grounded answer.
  • Figure 3: An overview of our GRPO post-training framework. The process starts with input puzzles which are dynamically weighted by difficulty using a curriculum learning approach. The agent iteratively generates solutions over multiple rounds. These solutions are evaluated using GRPO rewards, which in turn are used for policy evolution. We track reasoning-answer consistency during post-training and show that PC-GRPO boosts RAC and downstream performance.
  • Figure 4: Examples of three major types of annotation noise in vision-centric benchmarks. User studies show that $10\% \sim 20\%$ samples are noisy in these benchmarks. Nevertheless, our proposed method learns to produce faithful and visually-grounded answers. Left image taken from MME by Fu et al. is licensed for academic use (https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). Middle image taken from MMStar by Chen et al. is licensed under CC BY 4.0 (https://github.com/MMStar-Benchmark/MMStar). Right image taken from MMBench by Liu et al. is licensed under Apache 2.0 (https://github.com/open-compass/MMBench).
  • Figure 5: Tracking GRPO metrics during post-training across four puzzle environments. All charts report a moving average with window size of 100 over training steps. (a) Variance among the rollout rewards (b) Consistency rate between rollout reasoning and final answer, measured by Qwen2.5-VL-72B model (c) Average numbers of tokens decoded by each trajectory (d) Reward score which is the partially graded Jigsaw solution reward.
  • ...and 13 more figures