Table of Contents
Fetching ...

Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini, Evelina Leivada

TL;DR

This study investigates why Multimodal Large Language Models (MLLMs) struggle with quantification by grounding the problem in Generalized Quantifier Theory and testing three cross-linguistic features (scale ordering, ranges/prototypicality, and approximate-number biases) across six languages. It employs two tasks—a Production Task linking visual numerosity to quantifiers and an Embeddings task examining internal representations via cosine similarity—to compare humans with two model types: a reasoning model (o4-mini) and a non-reasoning model (GPT-4o). Key findings show humans consistently map quantifiers along a cross-linguistic scale, while GPT-4o lacks stable ordering and overuses high-magnitude quantifiers; o4-mini is more human-like in ordering but still diverges in ranges and prototypicality, with cross-language variability. The embedding analyses reveal non-gradiented mappings and language-resource effects, and the numerosity task highlights multi-causal biases in model perception, collectively suggesting that current MLLMs do not yet realize human-like semantic-pragmatic quantification across languages.

Abstract

Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.

Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

TL;DR

This study investigates why Multimodal Large Language Models (MLLMs) struggle with quantification by grounding the problem in Generalized Quantifier Theory and testing three cross-linguistic features (scale ordering, ranges/prototypicality, and approximate-number biases) across six languages. It employs two tasks—a Production Task linking visual numerosity to quantifiers and an Embeddings task examining internal representations via cosine similarity—to compare humans with two model types: a reasoning model (o4-mini) and a non-reasoning model (GPT-4o). Key findings show humans consistently map quantifiers along a cross-linguistic scale, while GPT-4o lacks stable ordering and overuses high-magnitude quantifiers; o4-mini is more human-like in ordering but still diverges in ranges and prototypicality, with cross-language variability. The embedding analyses reveal non-gradiented mappings and language-resource effects, and the numerosity task highlights multi-causal biases in model perception, collectively suggesting that current MLLMs do not yet realize human-like semantic-pragmatic quantification across languages.

Abstract

Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The experiment's visual stimuli.
  • Figure 2: Quantifier chosen per reported proportion of black squares. Panel A compares humans with GPT-4o, and Panel B humans with o4-mini.
  • Figure 3: Cosine Similarity of quantificational and proportional sentences from the embeddings-large-3-large model.
  • Figure 4: Real vs. reported proportion of black squares in the stimuli. The size of the dot indicates the number of responses, the red (stripped) line indicates the theoretically correct response, and the blue (continuous) line the model fit.
  • Figure 5: Cross-linguistic benchmark.