Table of Contents
Fetching ...

Cluster Paths: Navigating Interpretability in Neural Networks

Nicholas M. Kroeger, Vincent Bindschaedler

Abstract

While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.

Cluster Paths: Navigating Interpretability in Neural Networks

Abstract

While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.

Paper Structure

This paper contains 34 sections, 26 figures, 9 tables, 2 algorithms.

Figures (26)

  • Figure 1: An illustration of the cluster path creation process. In this example, there are three clusters in layer two, and four clusters in layer four. Layer three's clusters are hidden for brevity. For illustration purposes, clusters are depicted as circles in an $\mathcal{R}^2$ space. (A) We forward propagate the training dataset and (B) cluster each layer's activations. (C) As a new sample is forward propagated through the network, we find the closest cluster (the red dot represents the sample in the layer's feature space). (D) Finally, the enumeration of the closest clusters in each layer defines a cluster path, represented by a string of cluster indices.
  • Figure 2: Example images from the normal (left) and corrupted (right) versions of the SpuriousCIFAR10 dataset. In the normal set, each cat image typically has a red patch while each dog image has a blue patch, reflecting the spurious color-to-class correlation.
  • Figure 3: Cluster path visualizations for three distinct groups. The top panel shows a cluster predominantly comprising samples with blue patches, the middle panel contains samples mainly with red patches, and the bottom panel exhibits a heterogeneous mix of patch colors. These visualizations suggest that the network's internal representations are strongly influenced by the spurious patch cue in the top and middle clusters, while the bottom cluster reflects a lack of consistent reliance on the spurious signal.
  • Figure 4: Heatmap showing the average nearest-neighbor agreement on a test set, where each row represents a layer (e.g., conv_out, fc1, fc2, fc_classifier) and each column a patch color. The value in each cell is the percentage of training neighbors that share the test sample's patch color. Despite randomized patches, red and blue remain near 100% agreement, revealing that the network's learned representations still prioritize these spurious patch cues.
  • Figure 5: Each row shows one test image (original column) followed by four attribution methods (Saliency, Grad x Input, Integrated Gradients, and Gradient SHAP). Row 1: red-patched cat, all methods highlight the red square, ignoring the animal. Row 2: blue-patched dog, heat-maps lock onto the blue square. Row 3: green-patched dog (uncorrelated color), attention is diffused, with little focus on the patch. Row 4: cyan-patched cat (uncorrelated color), attributions scatter over the background with the cyan patch drawing almost no attention.
  • ...and 21 more figures