Table of Contents
Fetching ...

Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers

Lukas Boehm, Jonas Leo Mueller, Christoffer Loeffler, Leo Schwinn, Bjoern Eskofier, Dario Zanca

TL;DR

This work investigates how architectural differences shape perceptual invariances in artificial vision by generating metamers through ensemble-based optimization across CNNs and vision transformers. It introduces a multi-model metamer generation framework that optimizes activations across an ensemble using projected gradient descent and an inversion loss to produce metamers that are both natural-looking and cross-model recognizable. Evaluations across diverse model sets and image-quality metrics reveal that CNNs yield more recognizable and human-like metamers, while transformers produce metamers that look natural but transfer less across models, highlighting the impact of architectural biases on representational invariances. The findings underscore the value of ensemble approaches for improving cross-model consistency and offer insights for aligning machine-perceived visuals with human perception, with implications for interpretability and robustness across architectures.

Abstract

Understanding the perceptual invariances of artificial neural networks is essential for improving explainability and aligning models with human vision. Metamers - stimuli that are physically distinct yet produce identical neural activations - serve as a valuable tool for investigating these invariances. We introduce a novel approach to metamer generation by leveraging ensembles of artificial neural networks, capturing shared representational subspaces across diverse architectures, including convolutional neural networks and vision transformers. To characterize the properties of the generated metamers, we employ a suite of image-based metrics that assess factors such as semantic fidelity and naturalness. Our findings show that convolutional neural networks generate more recognizable and human-like metamers, while vision transformers produce realistic but less transferable metamers, highlighting the impact of architectural biases on representational invariances.

Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers

TL;DR

This work investigates how architectural differences shape perceptual invariances in artificial vision by generating metamers through ensemble-based optimization across CNNs and vision transformers. It introduces a multi-model metamer generation framework that optimizes activations across an ensemble using projected gradient descent and an inversion loss to produce metamers that are both natural-looking and cross-model recognizable. Evaluations across diverse model sets and image-quality metrics reveal that CNNs yield more recognizable and human-like metamers, while transformers produce metamers that look natural but transfer less across models, highlighting the impact of architectural biases on representational invariances. The findings underscore the value of ensemble approaches for improving cross-model consistency and offer insights for aligning machine-perceived visuals with human perception, with implications for interpretability and robustness across architectures.

Abstract

Understanding the perceptual invariances of artificial neural networks is essential for improving explainability and aligning models with human vision. Metamers - stimuli that are physically distinct yet produce identical neural activations - serve as a valuable tool for investigating these invariances. We introduce a novel approach to metamer generation by leveraging ensembles of artificial neural networks, capturing shared representational subspaces across diverse architectures, including convolutional neural networks and vision transformers. To characterize the properties of the generated metamers, we employ a suite of image-based metrics that assess factors such as semantic fidelity and naturalness. Our findings show that convolutional neural networks generate more recognizable and human-like metamers, while vision transformers produce realistic but less transferable metamers, highlighting the impact of architectural biases on representational invariances.

Paper Structure

This paper contains 20 sections, 1 equation, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: (Left) Overview of the metamer generation process. The activations of a model are used to guide the projected gradient descent for a batch of images. The active model is switched every few iterations. (Middle) Three techniques are used to evaluate the final metamers. Again, the activations resulting from the reference and metameric stimulus are extracted, turned into a distribution (over the entire dataset/batch), and then compared using the Jensen-Shannon divergence. Recognizability is an accuracy metric that compares the classification output between reference and metamer. Image Metrics are standalone functions that operate on the final image to rate its visual appearance. They are usually focused on low noise content and natural image elements. (Right) Some example metamers generated by a set of CNN models after 5000 steps (late stage).
  • Figure 2: Examples of single-model and ensemble metamers. Single models tend to produce metamers that quickly become unrecognizable when generated from intermediate or deeper layers. In contrast, ensembles generally yield more robust and recognizable metamers across layers, though this advantage is less pronounced for transformer-based ensembles. Each model or ensemble introduces distinct artifacts.
  • Figure 3: Recognizability curves. Each subplot shows the recognizability curves (accuracy vs. metamer generation stage: Early, Middle, Late) for various model sets. Comparisons include standard and robust variants of CNNs and Transformers. CNN ensembles and robust CNN ensembles retain high recognizability for late stage metamers.
  • Figure 4: Image metrics for ensemble metamers. The green arrow indicates whether a larger or lower value indicates better performance. Note that some metrics are normalized resulting in a maximum value of $1.0$.
  • Figure 5: Jensen-Shannon divergence. The Jensen-Shannon divergence is calculated between each possible pair of representational similarity distributions. Combinations are defined by the model set, the evaluation model, and the generation stage. High divergence values indicate that the distributions have distinct central tendencies. Cells with dashed outlines denote cases where the classification model (row) was included in the corresponding model set (column), rendering the comparison uninformative for our purposes. Such instances do not contribute meaningful insight into metamer transferability.
  • ...and 2 more figures