Table of Contents
Fetching ...

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

TL;DR

This work introduces Athena-PRM, a data-efficient process reward model for multimodal reasoning that assigns rewards to individual reasoning steps. It leverages a consistency-filtering approach between weak and strong completers to obtain high-quality process labels from a small dataset, dramatically reducing labeling cost relative to Monte Carlo methods. The authors show two training enhancements—ORM initialization and negative data up-sampling—and validate the approach across three deployment scenarios, achieving state-of-the-art results on VisualProcessBench and strong gains on several multimodal math benchmarks. They also demonstrate a reward-ranked finetuning pathway to create Athena-7B, a capable multimodal model with improved reasoning across benchmarks.

Abstract

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

TL;DR

This work introduces Athena-PRM, a data-efficient process reward model for multimodal reasoning that assigns rewards to individual reasoning steps. It leverages a consistency-filtering approach between weak and strong completers to obtain high-quality process labels from a small dataset, dramatically reducing labeling cost relative to Monte Carlo methods. The authors show two training enhancements—ORM initialization and negative data up-sampling—and validate the approach across three deployment scenarios, achieving state-of-the-art results on VisualProcessBench and strong gains on several multimodal math benchmarks. They also demonstrate a reward-ranked finetuning pathway to create Athena-7B, a capable multimodal model with improved reasoning across benchmarks.

Abstract

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Illustration of different completers under the same question and solution steps. Even if given wrong intermediate steps, the strong completer stills reach the final answer while the weak completer fails. We omitted some intermediate steps in the figure for simplicity.
  • Figure 2: Best-of-N results on the MathVista lu2024mathvista across different policies. The number of solutions we sample is from 4 to 64.
  • Figure 3: Results of data scaling from 5K to 60K. The number of sampled solutions per question is set to 8.
  • Figure 4: A case study from VisualProcessBench wang2025visualprm.
  • Figure 5: A case study from VisualProcessBench wang2025visualprm.
  • ...and 2 more figures