Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction
Sreejan Kumar, Raja Marjieh, Byron Zhang, Declan Campbell, Michael Y. Hu, Umang Bhatt, Brenden Lake, Thomas L. Griffiths
TL;DR
The paper investigates how language influences abstraction formation by extending serial reproduction to a multimodal setting that alternates between vision and language. It frames the process as a Markov chain over world states and abstractions with cross-modal priors $p_S(\mu)$ and $p_L(\mu)$, and analyzes how modality affects the stationary distribution of stimuli and the decode-ability of board complexity from language. Empirically, language transmission markedly reshapes human abstractions, while GPT-4 shows a closer coupling between vision and language priors, likely due to training on image–text data. This work advances cross-modal cognitive probing methods and has implications for evaluating and aligning AI systems with human-like abstraction processes.
Abstract
Humans extract useful abstractions of the world from noisy sensory data. Serial reproduction allows us to study how people construe the world through a paradigm similar to the game of telephone, where one person observes a stimulus and reproduces it for the next to form a chain of reproductions. Past serial reproduction experiments typically employ a single sensory modality, but humans often communicate abstractions of the world to each other through language. To investigate the effect language on the formation of abstractions, we implement a novel multimodal serial reproduction framework by asking people who receive a visual stimulus to reproduce it in a linguistic format, and vice versa. We ran unimodal and multimodal chains with both humans and GPT-4 and find that adding language as a modality has a larger effect on human reproductions than GPT-4's. This suggests human visual and linguistic representations are more dissociable than those of GPT-4.
