Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

Bushi Xiao; Michael Bennie; Jayetri Bardhan; Daisy Zhe Wang

Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

Bushi Xiao, Michael Bennie, Jayetri Bardhan, Daisy Zhe Wang

TL;DR

This work investigates whether multimodal large language models exhibit human-like structural priming by introducing PRISMATIC, the first multimodal structural priming dataset, and the Syntactic Preservation Index (SPI), a reference-free metric for sentence-level priming. It analyzes two architectural paradigms—dual encoding and fusion encoding—and demonstrates that fusion-encoded models show stronger alignment between visual similarity and syntactic preservation, aligning more closely with human psycholinguistic patterns. Across automatic and human-validated data, PRISMATIC provides a standardized benchmark, while SPI enables direct evaluation of generated sentences without fixed targets. The findings suggest that unified multimodal representations, as in fusion encoding, better support cross-modal syntactic influences, with implications for cognitive-aligned AI systems and future multimodal syntax research.

Abstract

Structural priming is a cognitive phenomenon where exposure to a particular syntactic structure increases the likelihood of producing the same structure in subsequent utterances. While humans consistently demonstrate structural priming effects across various linguistic contexts, it remains unclear whether multimodal large language models (MLLMs) exhibit similar syntactic preservation behaviors. We introduce PRISMATIC, the first multimodal structural priming dataset, which advances computational linguistics by providing a standardized benchmark for investigating syntax-vision interactions. We propose the Syntactic Preservation Index (SPI), a novel reference-free evaluation metric designed specifically to assess structural priming effects in sentence level. Using this metric, we constructed and tested models with two different multimodal encoding architectures to investigate their structural preservation capabilities. Our experimental results demonstrate that models with both encoding methods show comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.

Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

TL;DR

Abstract

Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)