Table of Contents
Fetching ...

Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark

Unggi Lee, Jaeyong Lee, Jiyeong Bae, Yeil Jeong, Junbo Koh, Gyeonggeon Lee, Gunho Lee, Taekyung Ahn, Hyeoncheol Kim

TL;DR

Pedagogy-R1 addresses the gap between strong reasoning in LRMs and the need for pedagogically coherent teaching behavior. It introduces a distillation-based training pipeline, the Well-balanced Educational Benchmark (WBEB) across SK, PK, KT, AES, and DM, and the Chain-of-Pedagogy (CoP) prompting strategy to elicit teacher-like reasoning. Empirical results show Pedagogy-R1 achieves more balanced and educationally aligned performance than standard baselines, with notable gains in pedagogical knowledge, knowledge tracing, and instructional decision-making, while preserving reasonable subject knowledge. The work offers practical implications for deploying LRMs in classrooms and educational platforms, supported by open datasets and a mixed-method evaluation that combines quantitative metrics with grounded theory–based qualitative analysis.

Abstract

Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.

Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark

TL;DR

Pedagogy-R1 addresses the gap between strong reasoning in LRMs and the need for pedagogically coherent teaching behavior. It introduces a distillation-based training pipeline, the Well-balanced Educational Benchmark (WBEB) across SK, PK, KT, AES, and DM, and the Chain-of-Pedagogy (CoP) prompting strategy to elicit teacher-like reasoning. Empirical results show Pedagogy-R1 achieves more balanced and educationally aligned performance than standard baselines, with notable gains in pedagogical knowledge, knowledge tracing, and instructional decision-making, while preserving reasonable subject knowledge. The work offers practical implications for deploying LRMs in classrooms and educational platforms, supported by open datasets and a mixed-method evaluation that combines quantitative metrics with grounded theory–based qualitative analysis.

Abstract

Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.

Paper Structure

This paper contains 36 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Radar chart illustrating the overall performance of our pedagogical reasoning model and baselines across all educational domains in our benchmark. The chart demonstrates the balanced and robust capabilities of our approach.
  • Figure 2: Overview of the development pipeline for the Well-balanced Educational Benchmark (WBEB) and Pedagogy-R1. The figure illustrates three key stages: (1) construction of a comprehensive educational benchmark through data collection, LLM translation, and human curation; (2) pedagogical reasoning via distillation and Chain-of-Pedagogy (CoP) prompting to train teacher-like models; and (3) mixed-method analysis combining quantitative and qualitative evaluation of pedagogical reasoning.
  • Figure 3: Prompt refinement from general step-by-step reasoning to pedagogically guided reasoning.
  • Figure 4: Quantitative analyses of domain-specific data. Left shows reasoning token amounts, and Right shows UT scores and transition word ratios.
  • Figure 5: Quantified qualitative analyses of domain-specific data. Left shows code type ratios by domain, and Right shows code type ratios with COP prompting.