The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

Xinyi Chen; Raquel Fernández; Sandro Pezzelle

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

Xinyi Chen, Raquel Fernández, Sandro Pezzelle

TL;DR

The paper introduces the BLA Benchmark to probe whether multimodal models genuinely understand basic image-text interactions by focusing on three simple linguistic constructions (Active-Passive, Coordination, Relative Clauses). It evaluates five vision-language models (discriminative CLIP/ViLBERT/LXMERT and generative BLIP2/OpenFlamingo), in zero-shot settings and under BLA-specific learning, revealing widespread gaps compared to humans and a notable edge for the generative BLIP2, especially with in-context learning. The authors demonstrate that while zero-shot performance is weak, task-specific learning provides meaningful gains, suggesting BLA can guide both evaluation and improvement of grounded language abilities. They also release the benchmark and code, advocating a path toward closing the gap in basic language understanding in multimodal models.

Abstract

Despite the impressive performance achieved by pre-trained language-and-vision models in downstream tasks, it remains an open question whether this reflects a proper understanding of image-text interaction. In this work, we explore to what extent they handle basic linguistic constructions -- active-passive voice, coordination, and relative clauses -- that even preschool children can typically master. We present BLA, a novel, automatically constructed benchmark to evaluate multimodal models on these Basic Language Abilities. We show that different types of Transformer-based systems, such as CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, in line with previous findings. Our experiments, in particular, show that most of the tested models only marginally benefit when fine-tuned or prompted with construction-specific samples. Yet, the generative BLIP2 shows promising trends, especially in an in-context learning setting. This opens the door to using BLA not only as an evaluation benchmark but also to improve models' basic language abilities.

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

TL;DR

Abstract

Paper Structure (49 sections, 5 figures, 6 tables)

This paper contains 49 sections, 5 figures, 6 tables.

Introduction
Related Work
Basic Language Comprehension Abilities
Language Abilities of Pre-Trained Multimodal Models
The BLA Benchmark
Linguistic Constructions
Active-Passive voice (AP)
Coordination (CO)
Relative Clause (RC)
Benchmark Format
Dataset Construction
I. Selection of entities and predicates
II. Minimum object size
III. Sentence construction
IV. Grammar acceptability
...and 34 more sections

Figures (5)

Figure 3: Comparison between model accuracies in zero-shot (lighter-color bars) and BLA-specific learning (darker bars). Results are obtained in the SD setting.
Figure 4: One example image from Visual Genome dataset with its region descriptions, QA, objects, attributes, and relationships canonicalized by krishna2017visual. One example annotations for relationships is <predicate: pulls, subject: horse, object: carriage, ...>, for attributes is <carriage, green>, for object is <object_id, width: ..., length: ..., x:..., y:...>
Figure 5: Example of the Appen question interface. The golden label of this question is "No".
Figure : Active-Passive voice T: the womanfeeds the man. T: the man is fed by the woman. F: the manfeeds the woman. F: the woman is fed by the man.
Figure : Active-Passive T: the gentleman kisses the woman. T: the woman is kissed by the gentleman. F: the woman kisses the gentleman. F: the gentleman is kissed by the woman.

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

TL;DR

Abstract

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)