Table of Contents
Fetching ...

Observation-specific explanations through scattered data approximation

Valentina Ghidini, Michael Multerer, Jacopo Quizi, Rohan Sen

TL;DR

This work reframes explainability by quantifying the influence of individual data points on a black-box predictor through observation-specific explanations. A surrogate model in a reproducing kernel Hilbert space is built via scattered data approximation and orthogonal matching pursuit to identify a small, informative subset of observations, from which normalized explanations $\gamma_i$ are derived. The surrogate provides provable reconstruction bounds: $f^*(x)$ closely approximates $f(x)$ on the sample set with $|f(x_i)-f^*(x_i)| \leq \varepsilon \|f\|_{\mathcal{H}}$, enabling per-point diagnostics. Empirical evaluations on synthetic (quadratic and Ackley) and a real possum dataset demonstrate high fidelity and reveal that influential observations tend to lie in boundary or sparsely populated regions, offering a data-centric lens on model behavior and potential insights into data representativeness and model fit.

Abstract

This work introduces the definition of observation-specific explanations to assign a score to each data point proportional to its importance in the definition of the prediction process. Such explanations involve the identification of the most influential observations for the black-box model of interest. The proposed method involves estimating these explanations by constructing a surrogate model through scattered data approximation utilizing the orthogonal matching pursuit algorithm. The proposed approach is validated on both simulated and real-world datasets.

Observation-specific explanations through scattered data approximation

TL;DR

This work reframes explainability by quantifying the influence of individual data points on a black-box predictor through observation-specific explanations. A surrogate model in a reproducing kernel Hilbert space is built via scattered data approximation and orthogonal matching pursuit to identify a small, informative subset of observations, from which normalized explanations are derived. The surrogate provides provable reconstruction bounds: closely approximates on the sample set with , enabling per-point diagnostics. Empirical evaluations on synthetic (quadratic and Ackley) and a real possum dataset demonstrate high fidelity and reveal that influential observations tend to lie in boundary or sparsely populated regions, offering a data-centric lens on model behavior and potential insights into data representativeness and model fit.

Abstract

This work introduces the definition of observation-specific explanations to assign a score to each data point proportional to its importance in the definition of the prediction process. Such explanations involve the identification of the most influential observations for the black-box model of interest. The proposed method involves estimating these explanations by constructing a surrogate model through scattered data approximation utilizing the orthogonal matching pursuit algorithm. The proposed approach is validated on both simulated and real-world datasets.
Paper Structure (9 sections, 11 equations, 3 figures)

This paper contains 9 sections, 11 equations, 3 figures.

Figures (3)

  • Figure 1: First simulated scenario: the data generating process is a quadratic function. In the left plot, the colors correspond to the observation-specific relative error in the surrogate model reconstruction: darker shades represent higher errors. In the right plot, selected data points are colored and sized according to the magnitude of their explanations: Darker and larger points indicate higher values.
  • Figure 2: Second simulated scenario: the data generating process is the Ackley function. In the left plot, the colors correspond to the observation-specific relative error in the surrogate model reconstruction: darker shades represent higher errors. In the right plot, selected data points are colored and sized according to the magnitude of their explanations: Darker and larger points indicate higher values.
  • Figure 3: Possum dataset: scatter plots of the height of the animal versus pairs of correlated, standardized anatomical features. In the left plot, point colors are with respect to the observation-specific relative error: darker shades represent higher errors. In the right plot, selected data points are colored and sized based on the magnitude of their explanations: darker and bigger points indicate higher values.

Theorems & Definitions (1)

  • definition \@thmcounterdefinition