Table of Contents
Fetching ...

Discriminating image representations with principal distortions

Jenelle Feather, David Lipshutz, Sarah E. Harvey, Alex H. Williams, Eero P. Simoncelli

TL;DR

The paper addresses the problem of distinguishing image representations when global geometry is similar but local geometry differs by proposing a Fisher information–based framework to quantify local sensitivities around a base image.It introduces a metric on local geometry, $m_{\mathbf{u},\mathbf{v}}(\mathbf{I}_A,\mathbf{I}_B)$, and defines 'principal distortions' that maximize cross-model variance across $N$ representations, extending beyond pairwise comparisons.The approach is validated on hand-crafted early-visual models and multiple deep neural networks, revealing architecture- and training-dependent differences in local sensitivity and showing efficient model differentiation with a small set of distortions.These principal distortions offer a tool for probing how model local geometries align with human perception and could inform interpretability, psychophysics experiments, and the analysis of robustness and texture-shape biases.

Abstract

Image representations (artificial or biological) are often compared in terms of their global geometric structure; however, representations with similar global structure can have strikingly different local geometries. Here, we propose a framework for comparing a set of image representations in terms of their local geometries. We quantify the local geometry of a representation using the Fisher information matrix, a standard statistical tool for characterizing the sensitivity to local stimulus distortions, and use this as a substrate for a metric on the local geometry in the vicinity of a base image. This metric may then be used to optimally differentiate a set of models, by finding a pair of "principal distortions" that maximize the variance of the models under this metric. As an example, we use this framework to compare a set of simple models of the early visual system, identifying a novel set of image distortions that allow immediate comparison of the models by visual inspection. In a second example, we apply our method to a set of deep neural network models and reveal differences in the local geometry that arise due to architecture and training types. These examples demonstrate how our framework can be used to probe for informative differences in local sensitivities between complex models, and suggest how it could be used to compare model representations with human perception.

Discriminating image representations with principal distortions

TL;DR

The paper addresses the problem of distinguishing image representations when global geometry is similar but local geometry differs by proposing a Fisher information–based framework to quantify local sensitivities around a base image.It introduces a metric on local geometry, $m_{\mathbf{u},\mathbf{v}}(\mathbf{I}_A,\mathbf{I}_B)$, and defines 'principal distortions' that maximize cross-model variance across $N$ representations, extending beyond pairwise comparisons.The approach is validated on hand-crafted early-visual models and multiple deep neural networks, revealing architecture- and training-dependent differences in local sensitivity and showing efficient model differentiation with a small set of distortions.These principal distortions offer a tool for probing how model local geometries align with human perception and could inform interpretability, psychophysics experiments, and the analysis of robustness and texture-shape biases.

Abstract

Image representations (artificial or biological) are often compared in terms of their global geometric structure; however, representations with similar global structure can have strikingly different local geometries. Here, we propose a framework for comparing a set of image representations in terms of their local geometries. We quantify the local geometry of a representation using the Fisher information matrix, a standard statistical tool for characterizing the sensitivity to local stimulus distortions, and use this as a substrate for a metric on the local geometry in the vicinity of a base image. This metric may then be used to optimally differentiate a set of models, by finding a pair of "principal distortions" that maximize the variance of the models under this metric. As an example, we use this framework to compare a set of simple models of the early visual system, identifying a novel set of image distortions that allow immediate comparison of the models by visual inspection. In a second example, we apply our method to a set of deep neural network models and reveal differences in the local geometry that arise due to architecture and training types. These examples demonstrate how our framework can be used to probe for informative differences in local sensitivities between complex models, and suggest how it could be used to compare model representations with human perception.

Paper Structure

This paper contains 23 sections, 21 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Comparing the local geometry of image representations. A) Each model is assumed to map stimuli to stochastic responses in a representation space---deterministic models can be investigated by assuming additive Gaussian response noise. For example, Model A maps the stimulus ${\mathbf{s}}$ (solid black circle $\bullet$) to a conditional density $p({\mathbf{r}}_A|{\mathbf{s}})$ in Model A's representation space (solid blue circle $\bullet$ and surrounding transparent blue ellipse). B) Model sensitivity at the base image ${\mathbf{s}}$ to local distortions can be mapped back to the stimulus domain via the model's positive semidefinite FIM. In the top panel "Eigen-distortions berardino2017eigen", the blue ellipse represents the unit level set $\{{\bm{v}}:d_A({\bm{v}})=1\}$ of the norm induced by Model A's FIM ${\bm{I}}_A$, which is the set of distortions of the base stimulus ${\mathbf{s}}$ that appear equally distorted according to Model A's representation. The eigenvectors of the FIM (${\boldsymbol{\epsilon}}_1$, ${\boldsymbol{\epsilon}}_2$) can equivalently be interpreted as the distortions that maximize the magnitude of the log ratio of the model's sensitivities (solid blue circle $\bullet$ on the number line). In the middle panel "Generalized Eigen-distortions zhou2023comparing", the blue ellipse is copied from the top panel and the orange ellipse is the level set of Model B's FIM ${\bm{I}}_B$. The generalized eigenvectors of ${\bm{I}}_A$ and ${\bm{I}}_B$ (${\boldsymbol{\epsilon}}_1$, ${\boldsymbol{\epsilon}}_2$) can equivalently be interpreted as the vectors that maximize the difference between the log ratios of the models' sensitivities. In the bottom panel "Principal distortions (this paper)", the blue and orange ellipses are as in the above panels and the green ellipse represents the level set of Model C's FIM ${\bm{I}}_C$. Here, we see two stimulus distortions (${\boldsymbol{\epsilon}}_1$, ${\boldsymbol{\epsilon}}_2$) that maximize the variance of the log ratios of model sensitivities.
  • Figure 2: Principal distortions of four early visual models. A) Log sensitivity ratios of the two principal distortions and two random distortions for each model berardino2017eigen. Models are nested (LN is the most basic, LGN is the full model). Principal distortions (filled circles) give rise to diverse log ratios, while random distortions (hollow circles) do not. B) Natural image $\mathbf{s}$ and corresponding optimized principal distortions $\{{\boldsymbol{\epsilon}}_1, {\boldsymbol{\epsilon}}_2\}$. C) Natural image corrupted by principal distortions, with each pair scaled so as to be equally detectable by one model (as indicated above). Models are ordered by the log ratio of their sensitivities (panel A). If a model's thresholds are proportional to human thresholds, the corresponding pair of scaled distortions should be equally visible in the top and bottom images. Note: Images are best viewed at high resolution.
  • Figure 3: AlexNet versus ResNet50. A) Example base images. B) Optimized principal distortions (scaled by a factor of 20 for visibility, and using the convention $\|{\boldsymbol{\epsilon}}\|=1$ here and in other figures). C) Base image plus principal distortions. D) Log sensitivity ratios of principal distortions when comparing image representations at multiple layers of AlexNet and ResNet50. Assignment of ${\boldsymbol{\epsilon}}_1$ and ${\boldsymbol{\epsilon}}_2$ was chosen so that the final tested layer of AlexNet has a positive log ratio. E) Log sensitivity ratios averaged across 100 base images (error bars are standard deviation). The principal distortions organize the networks by architecture---the log sensitivity ratios of AlexNet and ResNet50 are separated and early layers have smaller log ratios than late layers. AlexNet is more sensitive to distortion ${\boldsymbol{\epsilon}}_1$, which is concentrated on higher contrast or textured parts of the image (often the foreground object). ResNet50 is more sensitive to distortion ${\boldsymbol{\epsilon}}_2$ which concentrates power on relatively smooth parts of the image, such as regions of constant intensity/color.
  • Figure SI.1: Illustration of method for comparing early visual models to human perception. A) An example base image (${\mathbf{s}}$) and the principal distortions (${\boldsymbol{\epsilon}}_1$, ${\boldsymbol{\epsilon}}_2$) generated for the four early visual models (as in Fig. \ref{['fig:early_visual']}). B) Base image corrupted by the two principal distortions (top and bottom row), scaled by amplitudes that sum to 50 and correspond to different log sensitivity ratios (parenthesized values below the pairs, also indicated with grey arrows in panel C). A perceptual experiment can be designed to test which pair of corrupted images (indexed A through I) appear equally distorted to a human observer. This could be done directly, by showing various pairs of distorted images (i.e., ${\mathbf{s}}+k_1{\boldsymbol{\epsilon}}_1$ and ${\mathbf{s}}+k_2{\boldsymbol{\epsilon}}_2$ with varying amplitudes $k_1,k_2$) and asking an observer which of the two appears more distorted with respect to the original ($\mathbf{s}$). The pair of images for which an observer cannot decide (i.e., gives random answers) corresponds to the observer's log sensitivity ratio. Alternatively, this could be assessed by measuring the observer's detection thresholds for each distortion independently, and then taking their log ratio. This estimated human log sensitivity ratio can then be compared to the models' log sensitivity ratios (colored dots, panel C), to assess which model is best aligned with human behavior. Images are best viewed at high resolution.
  • Figure SI.2: Comparison of ImageNet-1k trained EfficientNet and Vision Transformer. A) Example principal distortions and associated log sensitivity ratio plots when comparing layers of EfficientNet-B0 and ViT (Base-Patch16-224). B) When measured over 72 images from the ImageNet validation set xiao2021noise, the obtained principal distortions reliably separate the models by architecture---the layers of EfficentNet are more sensitive to distortion ${\boldsymbol{\epsilon}}_1$, while the layers of ViT are more sensitive to ${\boldsymbol{\epsilon}}_2$. The principal distortions are qualitatively different than those obtained when comparing AlexNet and ResNet50---for instance, the principal distortions ${\boldsymbol{\epsilon}}_2$ (i.e., the distortion that the ViT layers are more sensitive) have notable grid artifacts corresponding to the size of the input patches to the ViT model. C) The distortions modify or preserve the label computed by each model, consistent with their predicted sensitivity (see Supp. Fig. \ref{['supfig:classification']} for a detailed explanation).
  • ...and 7 more figures