How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect

Siddhartha K. Vemuri; Raj Sanjay Shah; Sashank Varma

How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect

Siddhartha K. Vemuri, Raj Sanjay Shah, Sashank Varma

TL;DR

The paper investigates whether deep learning representations capture human typicality using a broad survey of language, vision, and multimodal models across $N=27$ concepts. It computes exemplar–prototype similarities via cosine distance and assesses alignment with human typicality via Spearman correlations, comparing single-modality and combined models, including a CLIP-based multimodal approach. Results show language-model embeddings align with human typicality more than vision models, while combining language and vision predictions yields the strongest correspondence (e.g., ~0.50 with AlexNet+MiniLM), and CLIP-based multimodal representations also show substantial promise. This work advances cognitive-aligned modeling in ML, introduces methodological resources, and highlights the potential and limits of current multimodal representations for capturing human concepts.

Abstract

How well do representations learned by ML models align with those of humans? Here, we consider concept representations learned by deep learning models and evaluate whether they show a fundamental behavioral signature of human concepts, the typicality effect. This is the finding that people judge some instances (e.g., robin) of a category (e.g., Bird) to be more typical than others (e.g., penguin). Recent research looking for human-like typicality effects in language and vision models has focused on models of a single modality, tested only a small number of concepts, and found only modest correlations with human typicality ratings. The current study expands this behavioral evaluation of models by considering a broader range of language (N = 8) and vision (N = 10) model architectures. It also evaluates whether the combined typicality predictions of vision + language model pairs, as well as a multimodal CLIP-based model, are better aligned with human typicality judgments than those of models of either modality alone. Finally, it evaluates the models across a broader range of concepts (N = 27) than prior studies. There were three important findings. First, language models better align with human typicality judgments than vision models. Second, combined language and vision models (e.g., AlexNet + MiniLM) better predict the human typicality data than the best-performing language model (i.e., MiniLM) or vision model (i.e., ViT-Huge) alone. Third, multimodal models (i.e., CLIP ViT) show promise for explaining human typicality judgments. These results advance the state-of-the-art in aligning the conceptual representations of ML models and humans. A methodological contribution is the creation of a new image set for testing the conceptual alignment of vision models.

How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect

TL;DR

The paper investigates whether deep learning representations capture human typicality using a broad survey of language, vision, and multimodal models across

concepts. It computes exemplar–prototype similarities via cosine distance and assesses alignment with human typicality via Spearman correlations, comparing single-modality and combined models, including a CLIP-based multimodal approach. Results show language-model embeddings align with human typicality more than vision models, while combining language and vision predictions yields the strongest correspondence (e.g., ~0.50 with AlexNet+MiniLM), and CLIP-based multimodal representations also show substantial promise. This work advances cognitive-aligned modeling in ML, introduces methodological resources, and highlights the potential and limits of current multimodal representations for capturing human concepts.

Abstract

Paper Structure (28 sections, 2 figures, 3 tables)

This paper contains 28 sections, 2 figures, 3 tables.

Introduction
The Typicality Effect
Typicality in ML Models
Typicality in Language Models
Typicality in Vision Models
Research Goals
Methods
Data Preparation
Human Typicality Ratings
Image Collection and Processing
Model Selection
Language Models
Vision Models
Multimodal Models
Task Paradigms
...and 13 more sections

Figures (2)

Figure 1: For each (language, vision) model combination, the Spearman correlation between its predicted typicalities and the human typicalities, averaged across all categories.
Figure 2: Beta weights of the linear models predicting the typicalities of each category for the best-performing combined model (AlexNet + MiniLM).

How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect

TL;DR

Abstract

How Well Do Deep Learning Models Capture Human Concepts? The Case of the Typicality Effect

Authors

TL;DR

Abstract

Table of Contents

Figures (2)