Table of Contents
Fetching ...

CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng

TL;DR

The paper tackles the problem that outcome-focused RL on MCQs can boost final accuracy but degrade reasoning quality, especially consistency, in domains with scarce data. It introduces CLARity, a cost-efficient RL framework that uses a small general-purpose LLM as a consistency reward within a two-stage refine-then-monitor training loop, complemented by a dynamic data reformulation strategy. CLARity delivers notable gains in both consistency (16.5%) and reliable reasoning accuracy (7.5%) over baselines in legal and medical domains, with human evaluators noting improvements in coherence, professionalism, and readability. Crucially, it eliminates the need for large PRMs or domain-specific expert data, offering a generalizable pathway for small LLMs to guide expert models toward higher reasoning quality and accuracy.

Abstract

Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

TL;DR

The paper tackles the problem that outcome-focused RL on MCQs can boost final accuracy but degrade reasoning quality, especially consistency, in domains with scarce data. It introduces CLARity, a cost-efficient RL framework that uses a small general-purpose LLM as a consistency reward within a two-stage refine-then-monitor training loop, complemented by a dynamic data reformulation strategy. CLARity delivers notable gains in both consistency (16.5%) and reliable reasoning accuracy (7.5%) over baselines in legal and medical domains, with human evaluators noting improvements in coherence, professionalism, and readability. Crucially, it eliminates the need for large PRMs or domain-specific expert data, offering a generalizable pathway for small LLMs to guide expert models toward higher reasoning quality and accuracy.

Abstract

Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

Paper Structure

This paper contains 48 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Illustration of risks in MCQ RL: rewarding only answer correctness neglects reasoning supervision, which may weaken reasoning quality during training.
  • Figure 2: Response quality dynamics under GRPO training. The logical consistency declines over time.
  • Figure 3: Overview of CLARity, an efficient MCQ RL framework that trains expert models using only small, general-purpose LLMs. It combines a consistency mechanism for detecting inconsistencies, a refine-then-monitor training pipeline for improving reasoning quality, and a dynamic data reformulation for maximizing data utility.
  • Figure 4: Training dynamics of three inconsistency types on Jec-QA validation set.
  • Figure 5: Reasoning quality comparison between CLARity and different baselines: without Stage-1, using Qwen-1.5B as the consistency reward model, and the vanilla Qwen2.5-7B-Instruct.
  • ...and 3 more figures