Table of Contents
Fetching ...

Reasoning-Aware GRPO using Process Mining

Taekhyun Park, Yongjae Lee, Hyerim Bae

TL;DR

PM4GRPO addresses the limitation of GRPO-based post-training that focuses solely on final answers by treating reasoning as a process and leveraging process mining to reward alignment with a pretrained teacher. The method defines a conformance reward using Inductive Miner to infer a reasoning process from the policy and alignment-based conformance to compare it with the teacher’s process, combined with traditional format and answer rewards. The total reward for a query is $R(x,y_i) = R_i^f + R_i^a + R_i^c$, where $R_i^c = \frac{2 \cdot fitness_i \cdot precision_i}{fitness_i + precision_i}$. Empirical results on 1.5B and 7B backbones across five math benchmarks show PM4GRPO achieving superior performance, particularly on MATH500 and OlympiadBench for larger models, demonstrating the effectiveness of process-mining–driven reasoning signals for RL post-training in large reasoning models.

Abstract

Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Reasoning-Aware GRPO using Process Mining

TL;DR

PM4GRPO addresses the limitation of GRPO-based post-training that focuses solely on final answers by treating reasoning as a process and leveraging process mining to reward alignment with a pretrained teacher. The method defines a conformance reward using Inductive Miner to infer a reasoning process from the policy and alignment-based conformance to compare it with the teacher’s process, combined with traditional format and answer rewards. The total reward for a query is , where . Empirical results on 1.5B and 7B backbones across five math benchmarks show PM4GRPO achieving superior performance, particularly on MATH500 and OlympiadBench for larger models, demonstrating the effectiveness of process-mining–driven reasoning signals for RL post-training in large reasoning models.

Abstract

Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Paper Structure

This paper contains 10 sections, 6 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Illustration of the Reasoning-Aware GRPO using Process Mining.