Table of Contents
Fetching ...

Explaining Representation Learning with Perceptual Components

Yavuz Yarici, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

TL;DR

The paper tackles the interpretability challenge of self-supervised representations by introducing a perceptual-component framework that analyzes color, shape, and texture through pixel-level importance maps obtained via selective masking. By measuring representation shifts with cosine similarity and using component-specific masking pipelines, the method yields intuitive explanations that persist even when labels are unavailable. The authors demonstrate that different training objectives (Supervised, SimCLR, VICReg, Barlow Twins) and image domains lead to distinct emphasis across perceptual components, with concrete findings such as texture being highly informative for birds and color for flowers. This approach advances explainability in representation learning by aligning explanations with human visual perception and enabling domain-aware analysis of learned spaces.

Abstract

Self-supervised models create representation spaces that lack clear semantic meaning. This interpretability problem of representations makes traditional explainability methods ineffective in this context. In this paper, we introduce a novel method to analyze representation spaces using three key perceptual components: color, shape, and texture. We employ selective masking of these components to observe changes in representations, resulting in distinct importance maps for each. In scenarios, where labels are absent, these importance maps provide more intuitive explanations as they are integral to the human visual system. Our approach enhances the interpretability of the representation space, offering explanations that resonate with human visual perception. We analyze how different training objectives create distinct representation spaces using perceptual components. Additionally, we examine the representation of images across diverse image domains, providing insights into the role of these components in different contexts.

Explaining Representation Learning with Perceptual Components

TL;DR

The paper tackles the interpretability challenge of self-supervised representations by introducing a perceptual-component framework that analyzes color, shape, and texture through pixel-level importance maps obtained via selective masking. By measuring representation shifts with cosine similarity and using component-specific masking pipelines, the method yields intuitive explanations that persist even when labels are unavailable. The authors demonstrate that different training objectives (Supervised, SimCLR, VICReg, Barlow Twins) and image domains lead to distinct emphasis across perceptual components, with concrete findings such as texture being highly informative for birds and color for flowers. This approach advances explainability in representation learning by aligning explanations with human visual perception and enabling domain-aware analysis of learned spaces.

Abstract

Self-supervised models create representation spaces that lack clear semantic meaning. This interpretability problem of representations makes traditional explainability methods ineffective in this context. In this paper, we introduce a novel method to analyze representation spaces using three key perceptual components: color, shape, and texture. We employ selective masking of these components to observe changes in representations, resulting in distinct importance maps for each. In scenarios, where labels are absent, these importance maps provide more intuitive explanations as they are integral to the human visual system. Our approach enhances the interpretability of the representation space, offering explanations that resonate with human visual perception. We analyze how different training objectives create distinct representation spaces using perceptual components. Additionally, we examine the representation of images across diverse image domains, providing insights into the role of these components in different contexts.
Paper Structure (12 sections, 9 equations, 3 figures, 2 tables)

This paper contains 12 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: This figure shows the overall importance score and importance scores for Color, Shape, and Texture for SimCLR ImageNet pre-trained encoder. ResNet50 is used as a backbone for the encoder. The image is taken from ImageNet.
  • Figure 2: This diagram illustrates how the importance map for Color, Shape, and Texture is produced. Circles represent the unmasked images and rectangles represent the masked images. Cosine similarities of unmasked and masked images are used for importance map generation for each component.
  • Figure 3: This figure shows importance scores and importance scores for Color, Shape, and Texture for Supervised, SimCLR, Barlow Twins, and VicReg models. Red indicates high values and blue indicates low values.