Table of Contents
Fetching ...

Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data

Yiheng Su, Junyi Jessy Li, Matthew Lease

TL;DR

Wrapper Boxes introduce a general pipeline that wraps neural encoders with transparent, example-based classifiers to deliver predictions that remain largely on par with neural baselines while enabling faithful attribution to training data. By training classic models on neural representations, the approach provides inherently interpretable explanations without retraining the underlying neural module, and supports data-centric AI practices and algorithmic recourse. Empirical results across two NLP tasks and multiple architectures (including zero-shot LLM representations) demonstrate competitive predictive performance and strong data-attribution capabilities, with the $k$NN, DT, and $L$-Means wrappers offering different tradeoffs in faithfulness, simplicity, and subset size $S_t$. The work discusses practical limitations, such as storage costs and computational considerations, and emphasizes potential applications in model auditing and decision contestation, positioning wrapper boxes as a versatile, data-centric framework for interpretable NLP.

Abstract

Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.

Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data

TL;DR

Wrapper Boxes introduce a general pipeline that wraps neural encoders with transparent, example-based classifiers to deliver predictions that remain largely on par with neural baselines while enabling faithful attribution to training data. By training classic models on neural representations, the approach provides inherently interpretable explanations without retraining the underlying neural module, and supports data-centric AI practices and algorithmic recourse. Empirical results across two NLP tasks and multiple architectures (including zero-shot LLM representations) demonstrate competitive predictive performance and strong data-attribution capabilities, with the NN, DT, and -Means wrappers offering different tradeoffs in faithfulness, simplicity, and subset size . The work discusses practical limitations, such as storage costs and computational considerations, and emphasizes potential applications in model auditing and decision contestation, positioning wrapper boxes as a versatile, data-centric framework for interpretable NLP.

Abstract

Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.
Paper Structure (50 sections, 1 equation, 2 figures, 14 tables, 2 algorithms)

This paper contains 50 sections, 1 equation, 2 figures, 14 tables, 2 algorithms.

Figures (2)

  • Figure 1: Three wrapper boxes illustrated for toxic language detection hartvigsen2022toxigen. Red and blue dots denote harmful vs. benign speech. Smaller dots represent examples, while larger dots represent clusters (e.g., DT leaf nodes). A neural model's penultimate layer provides the feature representation for the white wrapper boxes. Our results show that classic models can achieve comparable performance to the underlying neural models while also providing intuitive, example-based explanations (described in Section \ref{['sec:whiteboxes']}).
  • Figure 2: Visualization of resultant clusters from L-Means after PCA with two components.