The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
Xinyi Chen, Raquel Fernández, Sandro Pezzelle
TL;DR
The paper introduces the BLA Benchmark to probe whether multimodal models genuinely understand basic image-text interactions by focusing on three simple linguistic constructions (Active-Passive, Coordination, Relative Clauses). It evaluates five vision-language models (discriminative CLIP/ViLBERT/LXMERT and generative BLIP2/OpenFlamingo), in zero-shot settings and under BLA-specific learning, revealing widespread gaps compared to humans and a notable edge for the generative BLIP2, especially with in-context learning. The authors demonstrate that while zero-shot performance is weak, task-specific learning provides meaningful gains, suggesting BLA can guide both evaluation and improvement of grounded language abilities. They also release the benchmark and code, advocating a path toward closing the gap in basic language understanding in multimodal models.
Abstract
Despite the impressive performance achieved by pre-trained language-and-vision models in downstream tasks, it remains an open question whether this reflects a proper understanding of image-text interaction. In this work, we explore to what extent they handle basic linguistic constructions -- active-passive voice, coordination, and relative clauses -- that even preschool children can typically master. We present BLA, a novel, automatically constructed benchmark to evaluate multimodal models on these Basic Language Abilities. We show that different types of Transformer-based systems, such as CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, in line with previous findings. Our experiments, in particular, show that most of the tested models only marginally benefit when fine-tuned or prompted with construction-specific samples. Yet, the generative BLIP2 shows promising trends, especially in an in-context learning setting. This opens the door to using BLA not only as an evaluation benchmark but also to improve models' basic language abilities.
