Table of Contents
Fetching ...

What Makes a Face Look like a Hat: Decoupling Low-level and High-level Visual Properties with Image Triplets

Maytus Piriyajitakonkij, Sirawaj Itthipuripat, Ian Ballard, Ioannis Pappas

TL;DR

This work addresses how low-level visual properties affect high-level visual decisions by introducing a brain-model-guided method to decorrelate high- and low-level similarity in natural images. By constructing image-triplet stimuli using representations from CORnet-S (high-level) and VGG-16 (low-level) and computing $D_{ ext{high}}$ and $D_{ ext{low}}$, the authors show that human choices align with high-level similarity for CORnet-S and with low-level similarity for VGG-16, with BrainScore-based neural-predictivity supporting these dissociations. The approach provides a principled way to study how distinct stages of the ventral stream contribute to behavior and offers a tool to guide brain-inspired computer vision systems. The findings demonstrate that low- and high-level representations can differentially drive decision-making, and that reducing their correlation in natural stimuli enables clearer causal inferences about visual processing.

Abstract

In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.

What Makes a Face Look like a Hat: Decoupling Low-level and High-level Visual Properties with Image Triplets

TL;DR

This work addresses how low-level visual properties affect high-level visual decisions by introducing a brain-model-guided method to decorrelate high- and low-level similarity in natural images. By constructing image-triplet stimuli using representations from CORnet-S (high-level) and VGG-16 (low-level) and computing and , the authors show that human choices align with high-level similarity for CORnet-S and with low-level similarity for VGG-16, with BrainScore-based neural-predictivity supporting these dissociations. The approach provides a principled way to study how distinct stages of the ventral stream contribute to behavior and offers a tool to guide brain-inspired computer vision systems. The findings demonstrate that low- and high-level representations can differentially drive decision-making, and that reducing their correlation in natural stimuli enables clearer causal inferences about visual processing.

Abstract

In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.
Paper Structure (8 sections, 4 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 4 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Task Description: On each trial, subjects first made a semantic judgment about a root image, i.e., indoor or outdoor. Shortly afterwards, subjects were asked to indicate which of two images they thought was more similar to the root image.
  • Figure 2: Triplet Examples selected from the CC0-Things hebart2023things dataset: (Top-left) $I_{1}$ and $I_{2}$ have the same level of both low- and high-level similarity to $I_{\text{root}}$. (Top-right) $I_{2}$ has higher low-level similarity than $I_{1}$ but the same high-level similarity to $I_{\text{root}}$ ($I_{2}$ is more low-level similar to $I_{\text{root}}$). (Bottom-left) $I_{2}$ has higher high-level similarity than $I_{1}$ but the same low-level similarity to $I_{\text{root}}$. (Bottom-right) $I_{2}$ has both higher low- and high-level similarity than $I_{1}$ to $I_{\text{root}}$.
  • Figure 3: Behavioral Results: The y-axis represents the probability that a participant selects the left image $I_{1}$. The x-axis represents the high-level similarity of the left image $I_{1}$ versus the right image $I_{2}$. Mathematically, it is the discretized dissimilarity score $D_{\text{high}}(I_{\text{root}}, I_{1}, I_{2})$. 0 represents the lowest and negative value and 7 represents the highest and positive value, see \ref{['eq:ITsim']} for the definition of $D$. Left > Right refers to the condition $D_{\text{low}}(I_{\text{root}}, I_{1}, I_{2}) > 0$. Right > Left refers to the condition $D_{\text{low}}(I_{\text{root}}, I_{1}, I_{2}) < 0$. The pattern of each title is network name: high-level layer and low-level layer.
  • Figure 4: Neural Predictivity Score of each model and layer for V2 and IT areas. See \ref{['subsec:brainscore']} for details.