Table of Contents
Fetching ...

Who Taught You That? Tracing Teachers in Model Distillation

Somin Wadhwa, Chantal Shaib, Silvio Amir, Byron C. Wallace

TL;DR

This work tackles teacher attribution in model distillation by formalizing a closed-set problem: given a trained student $m$ and a candidate teacher set $\mathcal{M} = \{M_1, \dots, M_T\}$, identify the true teacher that produced the distillation under black-box access. It systematically compares perplexity, text similarity, and syntactic features for attribution, finding that traditional similarity and perplexity signals are unreliable, while higher-order lexical patterns—specifically Part-of-Speech templates—provide stronger discriminative power across summarization, QA, and instruction-following tasks. Empirical results show PoS templates yield higher accuracy (e.g., PubMed $0.68$, CommonsenseQA $0.67$) than $n$-grams or BoW in most cases, with Alpaca demonstrating a modest exception where $n$-grams can outperform PoS. The work contributes to transparency and compliance by outlining a practical footprint-based approach to attribute distilled capabilities, and it motivates future research into combining multiple signals and extending attribution to open-set, privacy-conscious scenarios. $\mathcal{M}$ denotes the candidate teachers and $m$ the student, with distributions evaluated over diverse datasets such as CNN-DailyMail, SumPubMed, Rotten Tomatoes, OpenBookQA, and CommonsenseQA.

Abstract

Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that $n$-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.

Who Taught You That? Tracing Teachers in Model Distillation

TL;DR

This work tackles teacher attribution in model distillation by formalizing a closed-set problem: given a trained student and a candidate teacher set , identify the true teacher that produced the distillation under black-box access. It systematically compares perplexity, text similarity, and syntactic features for attribution, finding that traditional similarity and perplexity signals are unreliable, while higher-order lexical patterns—specifically Part-of-Speech templates—provide stronger discriminative power across summarization, QA, and instruction-following tasks. Empirical results show PoS templates yield higher accuracy (e.g., PubMed , CommonsenseQA ) than -grams or BoW in most cases, with Alpaca demonstrating a modest exception where -grams can outperform PoS. The work contributes to transparency and compliance by outlining a practical footprint-based approach to attribute distilled capabilities, and it motivates future research into combining multiple signals and extending attribution to open-set, privacy-conscious scenarios. denotes the candidate teachers and the student, with distributions evaluated over diverse datasets such as CNN-DailyMail, SumPubMed, Rotten Tomatoes, OpenBookQA, and CommonsenseQA.

Abstract

Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that -gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.

Paper Structure

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We introduce the problem of teacher model attribution: Given a distilled student model (e.g., a fine-tuned GPT-2), determine which of a set of possible teacher models was distilled (here, Mistral).
  • Figure 2: Perplexity under teacher models of texts generated by different pupils on (a) Rotten-Tomatoes, (b) QuaRel, and (c) OpenBookQA. Teacher perplexity does not consistently identify the teacher.
  • Figure 3: AUC-ROC curves for a one-vs-rest LR classifier using similarity score as the sole feature. Performance across models is close to random (AUC $\approx$ 0.49–0.53), indicating limited discriminative power.
  • Figure 4: Influence of teacher models on student outputs, highlighting the retention of Part-of-Speech (PoS) templates. The color-coded PoS sequences illustrate how students inherit structural patterns from their respective teachers, suggesting that syntactic characteristics are preserved to some extent during knowledge transfer. This pattern indicates that PoS templates can serve as a distinguishing feature in identifying which teacher model was used to train a given student.
  • Figure 5: Average cosine similarity of student outputs (across all student models) over Bag-of-Words features with their true teachers vs. other teachers; these features provide little in the way of signal about teachers.