Table of Contents
Fetching ...

Perceptual misalignment of texture representations in convolutional neural networks

Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini

Abstract

Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

Perceptual misalignment of texture representations in convolutional neural networks

Abstract

Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

Paper Structure

This paper contains 21 sections, 20 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Representational Dissimilarity Matrices (5640x5640) obtained with Representational Similarity Analysis from the 5 layers analyzed for VGG-19. Each entry corresponds to the distance between the Gram matrix representation of one pair of images in DTD. The distance between representations is computed using cosine similarity.
  • Figure 2: MI values across CNN layers identified by index (1-5). Each line corresponds to one of the 13 models as color-coded in the legend.
  • Figure 3: Plots of the correlations with Pearson's p. Each subplot represents the correlation of Brain-Score average vision, neural vision, behavior vision, V1, V2, V4, IT respectively vs. the layer with the highest MI for each of the 13 CNNs. Each color identifies a model as reported in the legend.
  • Figure 4: Textures generated with the Gatys algorithm applied to a subsample of CNNs (rows). The 4 original images (top row) belong to 4 different classes in DTD (columns: blotchy; striped; matted; scaly). The last column represents one of the images originally used by Gatys et al., that we synthesized as a reference.Textures generated by the whole pool of 13 models can be found in Appendix (Supplementary Figures \ref{['fig:synthesized_all_1']}, \ref{['fig:synthesized_all_1']})