Who Taught You That? Tracing Teachers in Model Distillation
Somin Wadhwa, Chantal Shaib, Silvio Amir, Byron C. Wallace
TL;DR
This work tackles teacher attribution in model distillation by formalizing a closed-set problem: given a trained student $m$ and a candidate teacher set $\mathcal{M} = \{M_1, \dots, M_T\}$, identify the true teacher that produced the distillation under black-box access. It systematically compares perplexity, text similarity, and syntactic features for attribution, finding that traditional similarity and perplexity signals are unreliable, while higher-order lexical patterns—specifically Part-of-Speech templates—provide stronger discriminative power across summarization, QA, and instruction-following tasks. Empirical results show PoS templates yield higher accuracy (e.g., PubMed $0.68$, CommonsenseQA $0.67$) than $n$-grams or BoW in most cases, with Alpaca demonstrating a modest exception where $n$-grams can outperform PoS. The work contributes to transparency and compliance by outlining a practical footprint-based approach to attribute distilled capabilities, and it motivates future research into combining multiple signals and extending attribution to open-set, privacy-conscious scenarios. $\mathcal{M}$ denotes the candidate teachers and $m$ the student, with distributions evaluated over diverse datasets such as CNN-DailyMail, SumPubMed, Rotten Tomatoes, OpenBookQA, and CommonsenseQA.
Abstract
Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that $n$-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.
