Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks, Aida Nematzadeh
TL;DR
The paper introduces SVO-Probes, a verb-focused zero-shot benchmark to probe verb grounding in pretrained image–language transformers. By constructing a large verb set from Conceptual Captions and imSitu and assembling over 11k SVO triplets with roughly 48k image–sentence pairs and controlled negatives, the authors systematically test models’ ability to distinguish subtle verb-centered image–sentence relations. Results show verbs are consistently harder than nouns/objects for multiple architectures, and domain noise in pretraining data can degrade fine-grained distinctions, even when overall task performance is strong. The work highlights a clear gap in current multimodal representations for verb understanding and advocates for quieter pretraining data and verb-centric evaluation to guide future improvements.
Abstract
Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.
