Table of Contents
Fetching ...

Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

Teresa Dorszewski, Lenka Tětková, Lorenz Linhardt, Lars Kai Hansen

TL;DR

It is suggested that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks.

Abstract

Understanding how neural networks align with human cognitive processes is a crucial step toward developing more interpretable and reliable AI systems. Motivated by theories of human cognition, this study examines the relationship between \emph{convexity} in neural network representations and \emph{human-machine alignment} based on behavioral data. We identify a correlation between these two dimensions in pretrained and fine-tuned vision transformer models. Our findings suggest that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks. While optimizing for alignment generally enhances convexity, increasing convexity through fine-tuning yields inconsistent effects on alignment, which suggests a complex relationship between the two. This study presents a first step toward understanding the relationship between the convexity of latent representations and human-machine alignment.

Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

TL;DR

It is suggested that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks.

Abstract

Understanding how neural networks align with human cognitive processes is a crucial step toward developing more interpretable and reliable AI systems. Motivated by theories of human cognition, this study examines the relationship between \emph{convexity} in neural network representations and \emph{human-machine alignment} based on behavioral data. We identify a correlation between these two dimensions in pretrained and fine-tuned vision transformer models. Our findings suggest that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks. While optimizing for alignment generally enhances convexity, increasing convexity through fine-tuning yields inconsistent effects on alignment, which suggests a complex relationship between the two. This study presents a first step toward understanding the relationship between the convexity of latent representations and human-machine alignment.
Paper Structure (16 sections, 2 equations, 3 figures, 9 tables)

This paper contains 16 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Toy example with four potential cases of alignment vs. convexity in representation space. Apples and lemons (or cars and airplanes) should have high cosine similarity to achieve high alignment with humans. All objects should be neighboring objects (gray connection) of the same type for high convexity. In principle, high convexity does not necessarily imply high alignment and vice versa. We investigate which of the scenarios best reflects the representation structure of trained neural networks.
  • Figure 2: Correlation between convexity and human-machine alignment for ViT dosovitskiy2020vit, BEiT bao2021beit and data2vec baevski2022data2vec (all in base and large architecture, pretrained and fine-tuned (ft)) across all layers: a) Graph convexity of the THINGS superclasses across all models. Convexity increases steadily for fine-tuned models and peaks in the middle layers for pretrained models. Dotted line indicates the lower bound (random labeling). b) Human-machine alignment measured by OOOA across all models. OOOA peaks in the middle layers. Dotted lines indicate the lower bound (chance level) and upper bound (inter-human consistency level). c) Correlation between convexity and OOOA for pretrained and fine-tuned models respectively. There is a strong positive correlation in the first half of the model. Fine-tuned models show a strong negative correlation in late layers.
  • Figure 3: Correlation of OOOA and convexity across models. High correlation in the first half of models (0.91). Lower and even negative correlation in the second half for pretrained (0.4) and fine-tuned models (-0.54) respectively.

Theorems & Definitions (1)

  • Definition 1: Graph Convexity