Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data
Yiheng Su, Junyi Jessy Li, Matthew Lease
TL;DR
Wrapper Boxes introduce a general pipeline that wraps neural encoders with transparent, example-based classifiers to deliver predictions that remain largely on par with neural baselines while enabling faithful attribution to training data. By training classic models on neural representations, the approach provides inherently interpretable explanations without retraining the underlying neural module, and supports data-centric AI practices and algorithmic recourse. Empirical results across two NLP tasks and multiple architectures (including zero-shot LLM representations) demonstrate competitive predictive performance and strong data-attribution capabilities, with the $k$NN, DT, and $L$-Means wrappers offering different tradeoffs in faithfulness, simplicity, and subset size $S_t$. The work discusses practical limitations, such as storage costs and computational considerations, and emphasizes potential applications in model auditing and decision contestation, positioning wrapper boxes as a versatile, data-centric framework for interpretable NLP.
Abstract
Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.
