Table of Contents
Fetching ...

GATES: Self-Distillation under Privileged Context with Consensus Gating

Alex Stein, Furong Huang, Tom Goldstein

TL;DR

This work focuses on document-grounded question answering with asymmetric context, where a single model serves as both tutor and student, and derives supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning.

Abstract

We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.

GATES: Self-Distillation under Privileged Context with Consensus Gating

TL;DR

This work focuses on document-grounded question answering with asymmetric context, where a single model serves as both tutor and student, and derives supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning.

Abstract

We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Paper Structure (50 sections, 7 equations, 12 figures, 3 tables)

This paper contains 50 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of GATES. A tutor model, given a privileged document and question, generates multiple reasoning rollouts. A consensus gate filters rollouts based on answer agreement, discarding trajectories with minority answers. The surviving rollouts are used to train a student model---which receives only the question---via distillation loss, transferring the tutor's privileged reasoning without requiring ground-truth labels. The tutor and student share the same underlying model, differing only in whether the privileged document is included in the input context.
  • Figure 2: Off-policy distillation in GATES. A single model $\pi_\theta$ operates in two roles under asymmetric context: as a tutor conditioned on both the source document $d$ and question $q$, and as a student conditioned on $q$ alone. (A) The tutor generates $k$ independent reasoning rollouts per question. (B) A question-level consensus gate labels the question as reliable if sufficiently many rollouts agree on the same answer; a second, rollout-level gate retains only trajectories that match the consensus. (C) Eligible tutor trajectories provide dense token-level supervision to the document-free student via trajectory distillation (Eq. 1). Unreliable questions are skipped entirely, preventing self-reinforcement collapse.
  • Figure 3: On-policy distillation in GATES. As in the off-policy variant (Figure \ref{['fig:off-policy-schematic-final']}), a single model $\pi_\theta$ serves as both tutor and student under asymmetric context. (A) Both roles generate $k$ rollouts in parallel: tutor rollouts establish consensus, while student rollouts provide the on-policy training signal. (B) A question-level consensus gate determines reliability; unlike the off-policy setting there is no trajectory-level filter---when consensus is strong, all student rollouts pass through. (C) The tutor scores each token of the student's own rollouts by computing log-probabilities under document context. The per-token advantage $A_t = \mathrm{clip}(\log \pi_{\mathrm{tutor}} - \log \pi_{\mathrm{student}},\, [-a,a])$ upweights tokens where the tutor assigns higher probability, encouraging document-grounded reasoning while remaining on-policy (Eq. 3). Unreliable questions contribute zero loss.
  • Figure 4: Main results comparing GATES against baselines. (a) Accuracy (%) on four document-free math benchmarks (maj@8 decoding). (b) Student accuracy on the held-out asymmetric evaluation (50 questions, greedy decoding). GATES yields the best student accuracy (62%), improving 16 percentage points over the pretrained base model. See Appendix \ref{['app:additional_eval']} for tutor accuracy, greedy decoding, and coverage results.
  • Figure 5: Ablation results varying loss weights with $\lambda_{\text{KL}} = 0.02$ fixed throughout. (a) Accuracy (%) on document-free math benchmarks (maj@8 decoding). GATES (ours) is the canonical configuration. $-$ Gate uses the same loss weights but removes consensus gating, resulting in a $4.3$ pp benchmark drop despite identical loss configuration. Adding oracle loss provides no meaningful improvement (35.7 vs. 35.4), confirming that consensus gating alone is sufficient without verified correctness labels. (b) Student accuracy on the held-out asymmetric evaluation (greedy decoding). Off-policy distillation remains the dominant contributor: configurations without it show the largest drops in student accuracy. Tutor accuracy is reported in Appendix \ref{['app:additional_eval']}.
  • ...and 7 more figures