Training Feature Attribution for Vision Models
Aziz Bacha, Thomas George
TL;DR
The paper addresses the explainability gap by introducing Training Feature Attribution (TFA), which links test-time predictions to specific regions of training images. It combines a training data attribution score $S(z_i^{train}, z_j^{test})$ with a gradient-based feature attribution through Grad-Cos $S_{GC}$ to produce pixelwise, test-specific saliency maps on training examples. Analytic and empirical demonstrations on datasets such as CIFAR-10 and Pascal VOC 2012 show that TFA reveals not only which training samples are influential but also which regions within those samples matter, exposing harmful examples and patch-based shortcuts that FA or TDA alone miss. The method provides a practical tool for debugging and auditing vision models, with potential extension to higher-level, human-interpretable concepts in future work.
Abstract
Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
