Table of Contents
Fetching ...

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou

TL;DR

<CORE identifies and addresses the gap between recalling mathematical concepts and applying them in reasoning. It curates a textbook-based concept–exercise corpus, diagnoses non-conceptual reasoning with Concept Probes, and applies three concept-aware training recipes (CORE-Base, CORE-CR, CORE-KL) to reinforce concept-driven trajectories. Across multiple base and instruction-tuned models and diverse benchmarks, CORE yields consistent improvements in concept selection, application, and robustness without architectural changes. The work demonstrates that explicit grounding in concepts via RL supervision can significantly deepen mathematical reasoning and generalize beyond in-domain data.

Abstract

Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

TL;DR

<CORE identifies and addresses the gap between recalling mathematical concepts and applying them in reasoning. It curates a textbook-based concept–exercise corpus, diagnoses non-conceptual reasoning with Concept Probes, and applies three concept-aware training recipes (CORE-Base, CORE-CR, CORE-KL) to reinforce concept-driven trajectories. Across multiple base and instruction-tuned models and diverse benchmarks, CORE yields consistent improvements in concept selection, application, and robustness without architectural changes. The work demonstrates that explicit grounding in concepts via RL supervision can significantly deepen mathematical reasoning and generalize beyond in-domain data.

Abstract

Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.

Paper Structure

This paper contains 44 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An example of ChatGPT-4o’s superficial understanding of the Rational Root Theorem. Please read from left to right.
  • Figure 2: Overview of the Concept-Guided Reinforcement (CORE) framework. For a given query, the policy model generates multiple candidate solutions. If any solution is correct, CORE-Base proceeds. When all solutions fail, CORE activates concept-guided correction: the Concept Recall module retrieves relevant domain knowledge, and Concept Injection re-prompts the model with this guidance to form corrected trajectories. CORE-CR replaces failed paths with these concept-grounded ones to recover the learning signal, while CORE-KL distills the concept-enhanced reasoning back into the base policy via a forward KL loss, strengthening conceptual consistency and generalization.
  • Figure 3: Performance comparison on Common vs Individual metrics.