Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

Sreejan Kumar; Raja Marjieh; Byron Zhang; Declan Campbell; Michael Y. Hu; Umang Bhatt; Brenden Lake; Thomas L. Griffiths

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

Sreejan Kumar, Raja Marjieh, Byron Zhang, Declan Campbell, Michael Y. Hu, Umang Bhatt, Brenden Lake, Thomas L. Griffiths

TL;DR

The paper investigates how language influences abstraction formation by extending serial reproduction to a multimodal setting that alternates between vision and language. It frames the process as a Markov chain over world states and abstractions with cross-modal priors $p_S(\mu)$ and $p_L(\mu)$, and analyzes how modality affects the stationary distribution of stimuli and the decode-ability of board complexity from language. Empirically, language transmission markedly reshapes human abstractions, while GPT-4 shows a closer coupling between vision and language priors, likely due to training on image–text data. This work advances cross-modal cognitive probing methods and has implications for evaluating and aligning AI systems with human-like abstraction processes.

Abstract

Humans extract useful abstractions of the world from noisy sensory data. Serial reproduction allows us to study how people construe the world through a paradigm similar to the game of telephone, where one person observes a stimulus and reproduces it for the next to form a chain of reproductions. Past serial reproduction experiments typically employ a single sensory modality, but humans often communicate abstractions of the world to each other through language. To investigate the effect language on the formation of abstractions, we implement a novel multimodal serial reproduction framework by asking people who receive a visual stimulus to reproduce it in a linguistic format, and vice versa. We ran unimodal and multimodal chains with both humans and GPT-4 and find that adding language as a modality has a larger effect on human reproductions than GPT-4's. This suggests human visual and linguistic representations are more dissociable than those of GPT-4.

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

TL;DR

and

, and analyzes how modality affects the stationary distribution of stimuli and the decode-ability of board complexity from language. Empirically, language transmission markedly reshapes human abstractions, while GPT-4 shows a closer coupling between vision and language priors, likely due to training on image–text data. This work advances cross-modal cognitive probing methods and has implications for evaluating and aligning AI systems with human-like abstraction processes.

Abstract

Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Introduction
Methods
Theoretical Framework
Human Experiments
Machine Experiments
Measures of Board Complexity
Results
Qualitative Board Distribution
Chain Dynamics
Board Complexity Analyses
Decoding Board Complexity from Language
Discussion
Acknowledgements
Appendix
GPT4v Prompts

Figures (5)

Figure 1: Example Multimodal Serial Reproduction Chain in Humans. One participant sees a stimulus and transmits a language description of the stimulus. The next participant sees a language description and produces a stimulus matching the description. The chain alternates between vision and language.
Figure 2: Serial Reproduction Chains Across Modalities. Five example human and GPT-4 chains for each paradigm.
Figure 3: Most Frequent Boards Across Conditions. Numbers indicate the frequency of the board below it.
Figure 4: Mean Chain Velocity We computed mean instantaneous velocity of each chain by computing the hamming distance traveled between boards of consecutive timesteps. Error bars denote 95% confidence intervals across chains.
Figure 5: Transmitting through language has a larger effect on humans than GPT-4 (A). 95% confidence intervals for complexity measures across humans and GPT-4 for both types of chains. GPT-4 boards typically have higher complexity. Multimodal serial reproduction typically reduces complexity, and this reduction is more pronounced in humans than GPT-4. (B). Decoding ($R^{2}$) performance for predicting board complexity from the corresponding language description's sentence embeddings. Higher performance suggests that the complexity of the boards can be represented in language. Decoding performance increases from unimodal to multimodal chains and GPT-4 boards have higher decoding performance than human boards.

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

TL;DR

Abstract

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)