Table of Contents
Fetching ...

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie

TL;DR

DreamPRM introduces a domain-reweighted training framework for multimodal process reward models, addressing dataset quality imbalance through bi-level optimization that jointly tunes PRM parameters and domain weights. By training PRMs with Monte Carlo signals across multiple domains and evaluating on a meta-domain with an aggregation-function loss, the method improves generalization and inference reliability for multimodal reasoning. Empirical results across five benchmarks show consistent gains over vanilla PRMs and competitive MLLMs, including top performance on MathVista with a smaller backbone. The work demonstrates robust scaling, transferability to stronger models, and interpretable domain weights that correlate with data quality, offering a practical path to more reliable multimodal reasoning systems.

Abstract

Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

TL;DR

DreamPRM introduces a domain-reweighted training framework for multimodal process reward models, addressing dataset quality imbalance through bi-level optimization that jointly tunes PRM parameters and domain weights. By training PRMs with Monte Carlo signals across multiple domains and evaluating on a meta-domain with an aggregation-function loss, the method improves generalization and inference reliability for multimodal reasoning. Empirical results across five benchmarks show consistent gains over vanilla PRMs and competitive MLLMs, including top performance on MathVista with a smaller backbone. The work demonstrates robust scaling, transferability to stronger models, and interpretable domain weights that correlate with data quality, offering a practical path to more reliable multimodal reasoning systems.

Abstract

Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.

Paper Structure

This paper contains 33 sections, 10 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: DreamPRM improves multimodal reasoning by mitigating the dataset quality imbalance problem.Left: On five benchmarks, DreamPRM outperforms base model (InternVL-2.5-8B-MPO wang2024mpo) by an average of $+4.0\%$. DreamPRM also consistently surpasses Vanilla PRM trained without data selection. Right: Easy AI2Dkembhavi2016diagramworthdozenimages questions (weight 0.55) vs. hard M3CoTchen2024m3cotnovelbenchmarkmultidomain questions (weight 1.49) shows how DreamPRM prioritizes data that demand deeper reasoning - samples requiring knowledge from both textual and visual modalities for step-by-step logical deduction.
  • Figure 2: General flow of training PRM and using PRM for inference. Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs). Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional training of PRM has poor generalization capability due to distribution shift between training set and testing set.
  • Figure 3: The proposed bi-level optimization based domain-reweighting method.Lower-level optimization: In this stage, PRM's parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights. DreamPRM helps address dataset quality imbalance problems and leads to stronger and more generalizable reasoning performance.
  • Figure 4: Leaderboard on MathVista (as of October 15, 2025). The first column ("o4-mini + DreamPRM") reports our own evaluation, while the remaining results are taken from the official MathVista leaderboard. The compared models include VL-Rethinker Wang2025VLRethinker, Step R1-V-Mini StepFun2025StepR1VMini, Kimi-k1.6-preview Kimi2025k16preview, Kimi-k1.5 Kimi2025k15, Doubao-pro-1.5 DoubaoProduct, Ovis2-34B Ovis2_34B_2025, OpenAI o1 openai2024openaio1card, Llama 4 Maverick Llama4Maverick2025BlogLlama4MaverickHF, and Vision-R1-7B Huang2025VisionR1.
  • Figure 5: Comparative evaluation of DreamPRM on multimodal reasoning benchmarks. Radar charts report accuracy (%) on five datasets (WeMath, MathVista, MathVision, MMVet, and MMStar). (a) Impact of different data selection strategies. (b) Comparison with existing test-time scaling methods. (c) Ablation study of three key components, i.e. w/o aggregation function loss (AFL), w/o bi-level optimization (BLO), and w/o structural thinking (ST).
  • ...and 5 more figures