Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Zhen Tan; Chengshuai Zhao; Song Wang; Jundong Li; Tianlong Chen; Huan Liu

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu

Abstract

Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at https://github.com/Zhen-Tan-dmml/ExGRPO.git.

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Abstract

Paper Structure (44 sections, 1 theorem, 23 equations, 6 figures, 5 tables)

This paper contains 44 sections, 1 theorem, 23 equations, 6 figures, 5 tables.

Introduction
Related Work
Reinforcement Distillation via Explanatory Inversion
Explanatory Inversion: Crafting Probes for Deeper Reasoning
Stage 1: Data Curation
Stage 2: Supervised Fine-Tuning for Cold Start
Stage 3: Reinforcement Distillation via Explanatory GRPO (ExGRPO).
Interaction Protocol with Randomized Explanatory Probe Sampling
Rule-Based Reward Design for ExGRPO
Advantage Computation and Policy Update
Imitation-Based Policy Regularization:
Inference Procedure
Experiments
Experimental Setup
Main Results: Overall Performance
...and 29 more sections

Key Result

Theorem 3.1

Let $\pi_k$ and $\pi_{k'}$ be student policies trained with full ($k$-turn) and partial ($k' < k$) explanatory probe sequences, respectively. If the utility bonus $r_{\text{dsu}}$ is applied only when: Then the ExGRPO policy update with clipped importance sampling ensures: with strict inequality if $r_{\text{dsu}} > 0$ for any training instance.

Figures (6)

Figure 1: (a) Distilled LLMs often exhibit generalization limitations compared to teacher models (e.g., Gemini-1.5-Pro v.s. smaller distilled models on Test v.s. EI-Test set, which is the augmented version of the test set using Gemini-1.5-Pro and Explanatory Inversion (EI)). See more experimental details in Appendix \ref{['app:details']}. (b) This is exemplified by the reversal curse, where a model correctly solves a forward problem (e.g., 5-2=3) but fails its inverse. (c) Prior "Reverse Thinking" approaches, like RevThink chen-etal-2025-reverse, attempt A-to-Q reasoning. (d) Our ExGRPO method enhances distillation by using EI probes to challenge and refine student models via RL.
Figure 2: ExGRPO framework overview. The student model learns from multi-turn explanatory probe dialogues.
Figure 3: RL training curves with Gemma as student. Evolution of key reward components during ExGRPO training. Left: $R_{base}$. Right: $r_{dsu}$, scaled for visualization.
Figure 4: Sample efficiency comparison on eight datasets. Our ExGRPO method achieves higher accuracy than standard SFT across all training data fractions ($p\in \{0.1,0.25,0.5,1.0\}$), often surpassing SFT trained on the full dataset with only $10-25\%$ of the data.
Figure 5: Average accuracy v.s. average training token count. The dashed line shows the regression over the baselines.
...and 1 more figures

Theorems & Definitions (2)

Theorem 3.1
proof

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Abstract

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)