Table of Contents
Fetching ...

MindSet: Vision. A toolbox for testing DNNs on key psychological experiments

Valerio Biscione, Dong Yin, Gaurav Malhotra, Marin Dujmovic, Milton L. Montero, Guillermo Puebla, Federico Adolfi, Rachel F. Heaton, John E. Hummel, Benjamin D. Evans, Karim Habashy, Jeffrey S. Bowers

TL;DR

MindSet: Vision introduces a modular, open-source toolbox to test DNNs against visual psychology experiments by providing manipulable stimuli across low/mid-level vision, visual illusions, and shape/object recognition. The approach centers on three testing methods—Out-of-Distribution Classification, Similarity Judgment Analysis, and Decoder methods—enabling causal appraisal of DNN-human alignment rather than ranking models on aggregated benchmarks. Its datasets and regeneration scripts, released under MIT, allow researchers to test specific hypotheses with configurable parameters. This work aims to bridge computational modeling and psychology, accelerating the development of DNNs that emulate human visual processing and facilitating broader investigations into memory, language, and perception.

Abstract

Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method), accessible at https://github.com/MindSetVision/mindset-vision. We test ResNet-152 on each of these methods as an example of how the toolbox can be used.

MindSet: Vision. A toolbox for testing DNNs on key psychological experiments

TL;DR

MindSet: Vision introduces a modular, open-source toolbox to test DNNs against visual psychology experiments by providing manipulable stimuli across low/mid-level vision, visual illusions, and shape/object recognition. The approach centers on three testing methods—Out-of-Distribution Classification, Similarity Judgment Analysis, and Decoder methods—enabling causal appraisal of DNN-human alignment rather than ranking models on aggregated benchmarks. Its datasets and regeneration scripts, released under MIT, allow researchers to test specific hypotheses with configurable parameters. This work aims to bridge computational modeling and psychology, accelerating the development of DNNs that emulate human visual processing and facilitating broader investigations into memory, language, and perception.

Abstract

Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method), accessible at https://github.com/MindSetVision/mindset-vision. We test ResNet-152 on each of these methods as an example of how the toolbox can be used.
Paper Structure (44 sections, 14 figures)

This paper contains 44 sections, 14 figures.

Figures (14)

  • Figure 1: Comprehensive overview of the 'MindSet: Vision' datasets, arranged in three main categories. Each panel represents a distinct dataset, which is further divided into conditions. The images provide examples from these conditions, generated with default parameters.
  • Figure 2: Depiction of two of the three proposed methods of evaluating DNNs in the context of two representative datasets. The first method, out-of-distribution classification, is not depicted here. The Similarity Judgment Analysis (top panel) involves feeding pairs of images to DNNs and comparing the elicited internal representations. We illustrate this method via the 'Texturized Unfamiliar' dataset, showing that the network possesses human-like responses in earlier layers which diminish in the later ones. The Decoder Method (bottom panel) involves training and testing a simple linear layer attached to different stages of a frozen network. In the given example, we assess the response to the Ebbinghaus illusion. Our findings indicate an absence of illusory perception. Both examples use an ImageNet pre-trained ResNet-152.
  • Figure 3: Samples of GPT4 responses after being prompted by different images from our silhouettes and texturized datasets, each time spawning a new conversation. The model accuracy drops significantly with texturized images.
  • Figure 4: Illustration of the Crowding and Uncrowding effect. a. Observers perform a vernier discrimination task. A standard approach consists of measuring the vernier offset for which observers correctly discriminate in 75% of the trials. With the vernier alone, the offset is quite small. b. When a square is added the performance drastically drops (that is, the threshold-offset increases). This is the classic crowding effect. c. Adding more flankers increases performance again. This is referred to as uncrowding. d. The magnitude of crowding and uncrowding effects is contingent upon both short-range and long-range spatial interactions between visual elements. Furthermore, the specific characteristics and spatial positioning of flanker stimuli play a crucial role in modulating these effects. For example, the performance drops again for the depicted pattern.
  • Figure 5: Schematic of the generation procedure for producing a set of dotted stimuli. Starting with a pair of images in which the only discriminant feature is the location of a dot (Base Pair), an additional dot is added, yielding the Emergent Feature of proximity or orientation. The Emergent Feature of linearity is obtained by adding a dot to the orientation pair. Notice that the added dot is the same to both elements of the pair so it does not add on its own any discriminative features, but it generates additional features in relation with the surrounding dots.
  • ...and 9 more figures