Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier
Mengyao Du, Gang Yang, Han Fang, Quanjun Yin, Ee-chien Chang
TL;DR
Label Forensics provides a principled framework to recover the semantic meaning of hard-label outputs from black-box text classifiers. It combines WordNet-based initialization, a prompt-free prefix-tuned encoder–decoder sampler, and a prototype-scoring objective to construct a per-label semantic distribution that is both precise and broad. The approach is validated across five public classifiers and a real undocumented HuggingFace model, achieving about 92% label-consistency and enabling interpretable, auditable insights into deployed models. This work advances transparent AI by linking opaque outputs to semantically meaningful regions in embedding space, supporting responsible auditing of web-scale NLP systems.
Abstract
The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We believe this distribution should maintain two key properties: precise, with samples consistently classified into the target label, and general, covering the label's broad semantic space. To realize this, we design a semantic neighborhood sampler and an iterative optimization procedure to select representative seed sentences that jointly maximize label consistency and distributional coverage. The final output, an optimized seed sentence set combined with the sampler, constitutes the empirical distribution representing the label's semantics. Experiments on multiple black-box classifiers achieve an average label consistency of around 92.24 percent, demonstrating that the embedding regions accurately capture each classifier's label semantics. We further validate our framework on an undocumented HuggingFace classifier, enabling fine-grained label interpretation and supporting responsible AI auditing.
