A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Joseph Bingham

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Joseph Bingham

TL;DR

A computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery achieves robust referential grounding and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation.

Abstract

Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65\% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66\% of the time (versus 20\% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at https://anonymous.4open.science/r/metasequoia-9D13/README.md .

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 6 figures, 2 tables)

This paper contains 21 sections, 5 equations, 6 figures, 2 tables.

Introduction
Motivation, Background, and Related Work
Common Ground
The Repeated Reference Problem
Dynamic Semantics
Possible Worlds Semantics for Epistemic Modals
Perceptual Alignment
Methods
Query Construction for Web-Scraping
Image Matching
Image Alignment
Image Comparison
Formalizing Common Ground Establishment
Experimental Methods and Results
Transformations on $\varphi$
...and 6 more sections

Figures (6)

Figure 1: An overview of the repeated reference game and our framework for lexical entrainment and common ground establishment. Our paper addresses the right-hand side of this figure, where the human serves as director and the AI as matcher. The sets $\Gamma, \Xi, \Omega$ maintain the current state of common ground, with $\Gamma$ containing finalized conceptual pacts, $\Xi$ the set of conceptual pacts under negotiation, and $\Omega$ containing any pacts which were rejected.
Figure 2: The symmetric simplicial sets of common ground that exist between human co-performers $A$ and $B$, and machine co-performer $C$. Each pair of co-performers has their own common ground, representing shared understanding about themselves, their joint activity, and the environment, represented by the nodes $AB, AC, BC$ at the learned Wasserstein barycenters of mutual alignment. Common ground $ABC$ is shared with all.
Figure 3: An example of the repeated reference problem, the director on the right, the matcher on the left. The director issues an utterance, $\varphi$, indicating what they perceived the selected tangram stimuli to depict. The matcher can either guess which tangram they believe the director is referring to, pose a clarifying question, or wait for the director to provide more information. The illustrated example uses text from the open corpus.
Figure 4: Changes in the accuracy of our bindings resulting from varying the number of images scraped by the MCP matcher. These accuracy numbers represent achieving lexical entrainment on the referent in a single utterance, as a comparative measure.
Figure 5: An example of the distances of the closest 5 scraped photos to the target. The query text for this is "tangram figure sitting and looking". It should be noted that all queries were manually checked to ensure that the scraped images were unique from the tangram figure.
...and 1 more figures

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

TL;DR

Abstract

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)