Table of Contents
Fetching ...

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

Sreejan Kumar, Raja Marjieh, Byron Zhang, Declan Campbell, Michael Y. Hu, Umang Bhatt, Brenden Lake, Thomas L. Griffiths

TL;DR

The paper investigates how language influences abstraction formation by extending serial reproduction to a multimodal setting that alternates between vision and language. It frames the process as a Markov chain over world states and abstractions with cross-modal priors $p_S(\mu)$ and $p_L(\mu)$, and analyzes how modality affects the stationary distribution of stimuli and the decode-ability of board complexity from language. Empirically, language transmission markedly reshapes human abstractions, while GPT-4 shows a closer coupling between vision and language priors, likely due to training on image–text data. This work advances cross-modal cognitive probing methods and has implications for evaluating and aligning AI systems with human-like abstraction processes.

Abstract

Humans extract useful abstractions of the world from noisy sensory data. Serial reproduction allows us to study how people construe the world through a paradigm similar to the game of telephone, where one person observes a stimulus and reproduces it for the next to form a chain of reproductions. Past serial reproduction experiments typically employ a single sensory modality, but humans often communicate abstractions of the world to each other through language. To investigate the effect language on the formation of abstractions, we implement a novel multimodal serial reproduction framework by asking people who receive a visual stimulus to reproduce it in a linguistic format, and vice versa. We ran unimodal and multimodal chains with both humans and GPT-4 and find that adding language as a modality has a larger effect on human reproductions than GPT-4's. This suggests human visual and linguistic representations are more dissociable than those of GPT-4.

Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction

TL;DR

The paper investigates how language influences abstraction formation by extending serial reproduction to a multimodal setting that alternates between vision and language. It frames the process as a Markov chain over world states and abstractions with cross-modal priors and , and analyzes how modality affects the stationary distribution of stimuli and the decode-ability of board complexity from language. Empirically, language transmission markedly reshapes human abstractions, while GPT-4 shows a closer coupling between vision and language priors, likely due to training on image–text data. This work advances cross-modal cognitive probing methods and has implications for evaluating and aligning AI systems with human-like abstraction processes.

Abstract

Humans extract useful abstractions of the world from noisy sensory data. Serial reproduction allows us to study how people construe the world through a paradigm similar to the game of telephone, where one person observes a stimulus and reproduces it for the next to form a chain of reproductions. Past serial reproduction experiments typically employ a single sensory modality, but humans often communicate abstractions of the world to each other through language. To investigate the effect language on the formation of abstractions, we implement a novel multimodal serial reproduction framework by asking people who receive a visual stimulus to reproduce it in a linguistic format, and vice versa. We ran unimodal and multimodal chains with both humans and GPT-4 and find that adding language as a modality has a larger effect on human reproductions than GPT-4's. This suggests human visual and linguistic representations are more dissociable than those of GPT-4.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example Multimodal Serial Reproduction Chain in Humans. One participant sees a stimulus and transmits a language description of the stimulus. The next participant sees a language description and produces a stimulus matching the description. The chain alternates between vision and language.
  • Figure 2: Serial Reproduction Chains Across Modalities. Five example human and GPT-4 chains for each paradigm.
  • Figure 3: Most Frequent Boards Across Conditions. Numbers indicate the frequency of the board below it.
  • Figure 4: Mean Chain Velocity We computed mean instantaneous velocity of each chain by computing the hamming distance traveled between boards of consecutive timesteps. Error bars denote 95% confidence intervals across chains.
  • Figure 5: Transmitting through language has a larger effect on humans than GPT-4 (A). 95% confidence intervals for complexity measures across humans and GPT-4 for both types of chains. GPT-4 boards typically have higher complexity. Multimodal serial reproduction typically reduces complexity, and this reduction is more pronounced in humans than GPT-4. (B). Decoding ($R^{2}$) performance for predicting board complexity from the corresponding language description's sentence embeddings. Higher performance suggests that the complexity of the boards can be represented in language. Decoding performance increases from unimodal to multimodal chains and GPT-4 boards have higher decoding performance than human boards.