Table of Contents
Fetching ...

Beyond Accuracy: Uncovering the Role of Similarity Perception and its Alignment with Semantics in Supervised Learning

Katarzyna Filus, Mateusz Żarski

TL;DR

The paper tackles how deep vision models learn human-like similarity notions beyond accuracy. It introduces Deep Similarity Inspector (DSI) and uses multiple class similarity matrices—Network Class Similarity Matrices (NCSM), Confusion-based CCSM, and Semantic Class Similarity Matrices (SCSM)—with WordNet path similarity as semantic reference, plus time-resolved metrics like Similarity Alignment Index (SAI), Inverse Dissimilarity Metric (IDM), and Weights Similarity Index (WSI). Experiments across CNNs and ViTs on Mini-ImageNet (and CIFAR100) reveal a three-phase similarity emergence: initial surge, refinement, and stabilization, with CNNs aligning to semantics faster than ViTs and hybrids blending features. The framework yields actionable insights for explainable AI and model selection, and it generalizes to other tasks and modalities, paving the way for training-time regularization or loss-function design that leverages similarity structure.

Abstract

Similarity manifests in various forms, including semantic similarity that is particularly important, serving as an approximation of human object categorization based on e.g. shared functionalities and evolutionary traits. It also offers practical advantages in computational modeling via lexical structures such as WordNet with constant and interpretable similarity. As in the domain of deep vision, there is still not enough focus on the phenomena regarding the similarity perception emergence. We introduce Deep Similarity Inspector (DSI) -- a systematic framework to inspect how deep vision networks develop their similarity perception and its alignment with semantic similarity. Our experiments show that both Convolutional Neural Networks' (CNNs) and Vision Transformers' (ViTs) develop a rich similarity perception during training with 3 phases (initial similarity surge, refinement, stabilization), with clear differences between CNNs and ViTs. Besides the gradual mistakes elimination, the mistakes refinement phenomenon can be observed.

Beyond Accuracy: Uncovering the Role of Similarity Perception and its Alignment with Semantics in Supervised Learning

TL;DR

The paper tackles how deep vision models learn human-like similarity notions beyond accuracy. It introduces Deep Similarity Inspector (DSI) and uses multiple class similarity matrices—Network Class Similarity Matrices (NCSM), Confusion-based CCSM, and Semantic Class Similarity Matrices (SCSM)—with WordNet path similarity as semantic reference, plus time-resolved metrics like Similarity Alignment Index (SAI), Inverse Dissimilarity Metric (IDM), and Weights Similarity Index (WSI). Experiments across CNNs and ViTs on Mini-ImageNet (and CIFAR100) reveal a three-phase similarity emergence: initial surge, refinement, and stabilization, with CNNs aligning to semantics faster than ViTs and hybrids blending features. The framework yields actionable insights for explainable AI and model selection, and it generalizes to other tasks and modalities, paving the way for training-time regularization or loss-function design that leverages similarity structure.

Abstract

Similarity manifests in various forms, including semantic similarity that is particularly important, serving as an approximation of human object categorization based on e.g. shared functionalities and evolutionary traits. It also offers practical advantages in computational modeling via lexical structures such as WordNet with constant and interpretable similarity. As in the domain of deep vision, there is still not enough focus on the phenomena regarding the similarity perception emergence. We introduce Deep Similarity Inspector (DSI) -- a systematic framework to inspect how deep vision networks develop their similarity perception and its alignment with semantic similarity. Our experiments show that both Convolutional Neural Networks' (CNNs) and Vision Transformers' (ViTs) develop a rich similarity perception during training with 3 phases (initial similarity surge, refinement, stabilization), with clear differences between CNNs and ViTs. Besides the gradual mistakes elimination, the mistakes refinement phenomenon can be observed.

Paper Structure

This paper contains 34 sections, 2 equations, 38 figures, 3 tables.

Figures (38)

  • Figure 1: Mini-ImageNet: Weights Similarity Index (WSI).
  • Figure 2: Mini-ImageNet: SAI(NCSM, SCSM) Curves.
  • Figure 3: Mini-ImageNet: SAI(NCSM, CCSM) Curves.
  • Figure 4: Mini-ImageNet: NCSMs and CCSMs of ResNet18 and SwinV2 (epoch number in brackets).
  • Figure 5: Mini-ImageNet: Network-based IDM.
  • ...and 33 more figures