Table of Contents
Fetching ...

FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Nils Neukirch, Johanna Vielhaben, Nils Strodthoff

TL;DR

FeatInv introduces a probabilistic, spatially conditioned inversion from feature space to input space using a ControlNet-style diffusion model, enabling faithful reconstruction across CNNs and ViTs by conditioning on spatial feature maps $c_f$. The method preserves fine-grained spatial structure via unpooled feature maps, achieving high cosine-similarity and Top5 accuracy with competitive FID scores across backbones, and demonstrating robustness to OOD, adversarial, and corrupted data. Two applications—FeatInv-Viz for visualizing concept steering and analysis of the composite nature of feature space—showcase the practical interpretability and diagnostic potential of the approach. While scalable and versatile, limitations include domain restrictions to natural images and the need for dedicated training per backbone/layer, suggesting paths for broader deployment and extension.

Abstract

Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

TL;DR

FeatInv introduces a probabilistic, spatially conditioned inversion from feature space to input space using a ControlNet-style diffusion model, enabling faithful reconstruction across CNNs and ViTs by conditioning on spatial feature maps . The method preserves fine-grained spatial structure via unpooled feature maps, achieving high cosine-similarity and Top5 accuracy with competitive FID scores across backbones, and demonstrating robustness to OOD, adversarial, and corrupted data. Two applications—FeatInv-Viz for visualizing concept steering and analysis of the composite nature of feature space—showcase the practical interpretability and diagnostic potential of the approach. While scalable and versatile, limitations include domain restrictions to natural images and the need for dedicated training per backbone/layer, suggesting paths for broader deployment and extension.

Abstract

Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

Paper Structure

This paper contains 20 sections, 27 figures, 4 tables, 1 algorithm.

Figures (27)

  • Figure 1: FeatInv learns a probabilistic mapping from feature space to input space and thereby provides a visualization of how a sample is perceived by the respective model. The goal is to identify input samples within the set of natural images whose feature representations align most closely with the original feature representation of a given model. In this figure, we visualize reconstructed samples obtained by conditioning on the feature maps of the penultimate layer from ResNet50, ConvNeXt and SwinV2 models.
  • Figure 2: Schematic overview of the FeatInv approach.Left: Given a spatially resolved feature map $c_f$ of some pretrained model, we aim to infer an input $x'$ within the set of natural images, whose feature representation aligns as closely as possible with $c_f$, i.e., to learn a probabilistic mapping from feature space to input space. Previous work consider spatially pooled feature maps, whereas this work conditions on spatially resolved feature maps. Middle: We leverage a pretrained diffusion model, which gets conditioned on $c_f$ by means of a ControlNet architecture, which parametrizes an additive modification on top of the frozen diffusion model. Right top: The ControlNet adds trainable copies of blocks in the stable diffusion model, which are conditioned on the conditional input and added to the output of the original module, which is kept frozen. Right bottom: The feature map $c_f$ is processed through bilinear upsampling and a shallow convolutional encoder to serve as conditional input for the ControlNet.
  • Figure 3: Qualitative comparison of reconstructed samples for the ResNet50, ConvNeXt, SwinV2 and ConvNeXt pooled models. The cosine similarity of the original feature map and that of the reconstruction is noted at the bottom edge of the images. The qualitative comparison confirms the insights from the quantitative analysis in Tab. \ref{['tab:model_comparison']}. The two modern vision backbones, ConvNeXt and SwinV2, show reconstructions that resemble the original very closely, not only in terms of semantic content and spatial alignment but also in terms of color schemes and finegrained details. Semantic content and composition also mostly matches in case of the ResNet50, but not even the semantic content seems to be captured when using pooled representations (ConvNeXt pooled as an example).
  • Figure 4: Comparison of the cosine similarity, FID score and top5(1)match for the models trained on the last feature maps of the first, second, and third stage of ConvNeXt calculated by averaging the cosine similarity of all superpixels. The models trained on earlier features and reconstructing them perform better in terms of cosine-sim, FID score and matchings than the one which was trained on the very last feature map
  • Figure 5: Reconstruction of OOD samples. Comparison of randomly selected original and generated samples from the BAID Yi_2023_CVPR dataset, which differs in style from the samples in ImageNet. The basis for the reconstruction was the ControlNet trained on ConvNeXt features, which received the ConvNeXt features of the samples shown as input. The cosine similarity between the original feature map and that of the reconstruction is noted at the bottom edge of the images
  • ...and 22 more figures