Table of Contents
Fetching ...

Training Feature Attribution for Vision Models

Aziz Bacha, Thomas George

TL;DR

The paper addresses the explainability gap by introducing Training Feature Attribution (TFA), which links test-time predictions to specific regions of training images. It combines a training data attribution score $S(z_i^{train}, z_j^{test})$ with a gradient-based feature attribution through Grad-Cos $S_{GC}$ to produce pixelwise, test-specific saliency maps on training examples. Analytic and empirical demonstrations on datasets such as CIFAR-10 and Pascal VOC 2012 show that TFA reveals not only which training samples are influential but also which regions within those samples matter, exposing harmful examples and patch-based shortcuts that FA or TDA alone miss. The method provides a practical tool for debugging and auditing vision models, with potential extension to higher-level, human-interpretable concepts in future work.

Abstract

Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

Training Feature Attribution for Vision Models

TL;DR

The paper addresses the explainability gap by introducing Training Feature Attribution (TFA), which links test-time predictions to specific regions of training images. It combines a training data attribution score with a gradient-based feature attribution through Grad-Cos to produce pixelwise, test-specific saliency maps on training examples. Analytic and empirical demonstrations on datasets such as CIFAR-10 and Pascal VOC 2012 show that TFA reveals not only which training samples are influential but also which regions within those samples matter, exposing harmful examples and patch-based shortcuts that FA or TDA alone miss. The method provides a practical tool for debugging and auditing vision models, with potential extension to higher-level, human-interpretable concepts in future work.

Abstract

Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

Paper Structure

This paper contains 35 sections, 31 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: A prediction on a test example depends both on the features of the example, as well as on features learned from the training examples through the trained parameters.
  • Figure 2: TFA saliencies (Equation \ref{['eq:saliency']}) for the top-3 most influential training images per test image. Each panel (left to right): test image, influential training images, and their influence maps (smoothed using Equation \ref{['eq:smoothed']}).
  • Figure 3: Two examples showing test-image dependence of pixelwise influence maps. In each example, the same training image yields different saliency patterns depending on the test image label. Left pair: dog vs person ; Right pair: cat vs person.
  • Figure 4: For a test image of a sheep misclassified as dog, the two most influential training images are (1) a dalmatian, and (2) an image containing both a dog and sheep. The influence maps show the model relies on the dog regions when predicting the test image.
  • Figure 5: Left to right: (1) Test image of sheep, misclassified as cow; (2) Most influential training image (sheep); (3) Pixelwise influence map reveals the model heavily relies on the red patch for its prediction; (4) Grad-CAM map for the test image.
  • ...and 6 more figures