Table of Contents
Fetching ...

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten

TL;DR

This work introduces Compute as Teacher (CaT), a framework that enables RL in settings lacking ground-truth references by turning inference compute into supervision. It combines reference estimation (e.g., synthesis) to produce a pseudo-reference and reward derivation (verifiable or rubric-based) to guide training, allowing parallel rollouts to inform learning without human labels. The approach is validated on HealthBench (non-verifiable) and MATH-500 (verifiable), showing CaT can match or exceed inference-time aggregation while reducing test-time compute by up to 9×, and achieving up to 30% relative improvement over the initial policy in non-verifiable domains. A key innovation is self-proposed rubrics for non-verifiable tasks, which, scored by a judge, provide stable, auditable rewards without human annotations. Across three model families, CaT demonstrates strong scalability and versatility, acting as a drop-in mechanism that extends RL learning to domains where programmatic checkers are unavailable or non-existent.

Abstract

Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

TL;DR

This work introduces Compute as Teacher (CaT), a framework that enables RL in settings lacking ground-truth references by turning inference compute into supervision. It combines reference estimation (e.g., synthesis) to produce a pseudo-reference and reward derivation (verifiable or rubric-based) to guide training, allowing parallel rollouts to inform learning without human labels. The approach is validated on HealthBench (non-verifiable) and MATH-500 (verifiable), showing CaT can match or exceed inference-time aggregation while reducing test-time compute by up to 9×, and achieving up to 30% relative improvement over the initial policy in non-verifiable domains. A key innovation is self-proposed rubrics for non-verifiable tasks, which, scored by a judge, provide stable, auditable rewards without human annotations. Across three model families, CaT demonstrates strong scalability and versatility, acting as a drop-in mechanism that extends RL learning to domains where programmatic checkers are unavailable or non-existent.

Abstract

Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

Paper Structure

This paper contains 67 sections, 12 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: CaT framework. The policy $\pi_t$ generates $G$ rollouts for prompt $q$. Reference estimation (synthesis, majority vote, or best-of-N) aggregates them into pseudo-reference $s$. Reward derivation scores each rollout against $s$: answer matching for verifiable domains, self-proposed rubrics for non-verifiable. Rewards drive RL, updating the policy.
  • Figure 2: Rubric-based rewards for non-verifiable domains. The anchor $\pi_0$ generates binary (yes/no) criteria $\mathcal{R} = \{r_1, \ldots, r_n\}$ from pseudo-reference $s$. The judge $\pi_J$ scores rollout $o_i$ against each criterion; the reward is the fraction satisfied.
  • Figure 3: CaT improves models by up to 30% relative to the initial policy. CaT matches or exceeds inference-time synthesis while using $9\times$ less test-time compute. CaT outperforms synthesis with Gemma and Llama on HealthBench ($p<.05$) with Welch's $t$-test.
  • Figure 4: Left: Self-proposed rubrics match physician-annotated rubrics and outperform model-as-judge ($p<.05$). Right: RL with rubric rewards outperforms SFT on synthesized references. CaT outperforms CaT-SFT in all cases ($p<.05$). Significance with Welch's $t$-test.
  • Figure 5: Synthesis exceeds selection baselines at inference time in non-verifiable domains (HealthBench). Synthesis is also competitive with the best methods on verifiable domains (MATH-500), enabling 'drop-in' use across domains. % improvement is relative to single sample. Synthesis gains over the next best method on HealthBench are all statistically significant ($p<.05$) using Welch's $t$-test.
  • ...and 1 more figures