Table of Contents
Fetching ...

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang

TL;DR

This work tackles sparse rewards in fine-grained visual reasoning over structured maps by introducing ReasonMap-Plus, a densely supervised extension of ReasonMap, and RewardMap, a two-part framework combining a difficulty-aware reward design with a multi-stage curriculum based on Group Relative Policy Optimization. RewardMap formulates a reward R that combines format, correctness, and detail with map- and question-level weighting to guide learning from simple perception to complex reasoning, enabling effective cold-start RL. Empirical results show RewardMap yields consistent gains on ReasonMap, ReasonMap-Plus, and six additional benchmarks, with an average improvement of 3.47% across diverse tasks, indicating improved visual understanding and topological reasoning. The approach demonstrates strong generalization beyond transit maps and provides a principled pathway to addressing long-horizon visual reasoning in multimodal models.

Abstract

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

TL;DR

This work tackles sparse rewards in fine-grained visual reasoning over structured maps by introducing ReasonMap-Plus, a densely supervised extension of ReasonMap, and RewardMap, a two-part framework combining a difficulty-aware reward design with a multi-stage curriculum based on Group Relative Policy Optimization. RewardMap formulates a reward R that combines format, correctness, and detail with map- and question-level weighting to guide learning from simple perception to complex reasoning, enabling effective cold-start RL. Empirical results show RewardMap yields consistent gains on ReasonMap, ReasonMap-Plus, and six additional benchmarks, with an average improvement of 3.47% across diverse tasks, indicating improved visual understanding and topological reasoning. The approach demonstrates strong generalization beyond transit maps and provides a principled pathway to addressing long-horizon visual reasoning in multimodal models.

Abstract

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

Paper Structure

This paper contains 24 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of ReasonMap-Plus. ReasonMap-Plus comprises $4{,}018$ questions from $5$ extended question types and maps from $30$ cities across $13$ countries.
  • Figure 2: Overview of RewardMap. The framework enhances fine-grained visual understanding and reasoning in MLLMs through reinforcement learning with Group Relative Policy Optimization (GRPO). It consists of two key components: (1) a difficulty-aware reward design (Section \ref{['sec:difficulty-aware reward design']}), which combines format, correctness, and detail rewards with difficulty-based weighting; and (2) a multi-stage RL curriculum (Section \ref{['sec:multi-stage-rl']}), which schedules training data from simple perception tasks to complex reasoning tasks, ensuring effective optimization tackling sparse rewards.
  • Figure 3: Qualitative comparisons among reference models, baseline, and our proposed RewardMap. We crop and zoom in on the transit map for clearer presentation.
  • Figure 4: Comparison of training rewards between baseline RL and RewardMap. The yellow curve denotes the reward trajectory of RewardMap, while the blue curve corresponds to the baseline RL trained solely on ReasonMap.