Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models
Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan Sajjad
TL;DR
This paper investigates whether text encoders in vision-language models (TE-VLMs) encode syntactic structure and how pre-training objectives, model size, and data volume influence this ability. Using DepProbe to recover Universal Dependencies trees from representations of TE-VLMs (e.g., CLIP, FLAVA) and unimodal and sentence-language models, the authors compare across layers and data. They find that unimodal language models encode syntactic information more effectively than TE-VLMs, and that pre-training objectives strongly shape syntactic knowledge, with contrastive-only training yielding particularly weak syntax signals. The results motivate incorporating auxiliary objectives into VLM pre-training to improve linguistic structure understanding, with implications for downstream tasks that depend on syntax.
Abstract
Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding, though the underlying reasons for these limitations remain unclear. In this work, we aim to address this gap by analyzing the syntactic information, one of the fundamental linguistic properties, encoded by the text encoders of VLMs. We perform a thorough analysis comparing VLMs with different objective functions, parameter size and training data size, and with uni-modal language models (ULMs) in their ability to encode syntactic knowledge. Our findings suggest that ULM text encoders acquire syntactic information more effectively than those in VLMs. The syntactic information learned by VLM text encoders is shaped primarily by the pre-training objective, which plays a more crucial role than other factors such as model architecture, model size, or the volume of pre-training data. Models exhibit different layer-wise trends where CLIP performance dropped across layers while for other models, middle layers are rich in encoding syntactic knowledge.
