Table of Contents
Fetching ...

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

Xinyi Chen, Raquel Fernández, Sandro Pezzelle

TL;DR

The paper introduces the BLA Benchmark to probe whether multimodal models genuinely understand basic image-text interactions by focusing on three simple linguistic constructions (Active-Passive, Coordination, Relative Clauses). It evaluates five vision-language models (discriminative CLIP/ViLBERT/LXMERT and generative BLIP2/OpenFlamingo), in zero-shot settings and under BLA-specific learning, revealing widespread gaps compared to humans and a notable edge for the generative BLIP2, especially with in-context learning. The authors demonstrate that while zero-shot performance is weak, task-specific learning provides meaningful gains, suggesting BLA can guide both evaluation and improvement of grounded language abilities. They also release the benchmark and code, advocating a path toward closing the gap in basic language understanding in multimodal models.

Abstract

Despite the impressive performance achieved by pre-trained language-and-vision models in downstream tasks, it remains an open question whether this reflects a proper understanding of image-text interaction. In this work, we explore to what extent they handle basic linguistic constructions -- active-passive voice, coordination, and relative clauses -- that even preschool children can typically master. We present BLA, a novel, automatically constructed benchmark to evaluate multimodal models on these Basic Language Abilities. We show that different types of Transformer-based systems, such as CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, in line with previous findings. Our experiments, in particular, show that most of the tested models only marginally benefit when fine-tuned or prompted with construction-specific samples. Yet, the generative BLIP2 shows promising trends, especially in an in-context learning setting. This opens the door to using BLA not only as an evaluation benchmark but also to improve models' basic language abilities.

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

TL;DR

The paper introduces the BLA Benchmark to probe whether multimodal models genuinely understand basic image-text interactions by focusing on three simple linguistic constructions (Active-Passive, Coordination, Relative Clauses). It evaluates five vision-language models (discriminative CLIP/ViLBERT/LXMERT and generative BLIP2/OpenFlamingo), in zero-shot settings and under BLA-specific learning, revealing widespread gaps compared to humans and a notable edge for the generative BLIP2, especially with in-context learning. The authors demonstrate that while zero-shot performance is weak, task-specific learning provides meaningful gains, suggesting BLA can guide both evaluation and improvement of grounded language abilities. They also release the benchmark and code, advocating a path toward closing the gap in basic language understanding in multimodal models.

Abstract

Despite the impressive performance achieved by pre-trained language-and-vision models in downstream tasks, it remains an open question whether this reflects a proper understanding of image-text interaction. In this work, we explore to what extent they handle basic linguistic constructions -- active-passive voice, coordination, and relative clauses -- that even preschool children can typically master. We present BLA, a novel, automatically constructed benchmark to evaluate multimodal models on these Basic Language Abilities. We show that different types of Transformer-based systems, such as CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, in line with previous findings. Our experiments, in particular, show that most of the tested models only marginally benefit when fine-tuned or prompted with construction-specific samples. Yet, the generative BLIP2 shows promising trends, especially in an in-context learning setting. This opens the door to using BLA not only as an evaluation benchmark but also to improve models' basic language abilities.
Paper Structure (49 sections, 5 figures, 6 tables)

This paper contains 49 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 3: Comparison between model accuracies in zero-shot (lighter-color bars) and BLA-specific learning (darker bars). Results are obtained in the SD setting.
  • Figure 4: One example image from Visual Genome dataset with its region descriptions, QA, objects, attributes, and relationships canonicalized by krishna2017visual. One example annotations for relationships is <predicate: pulls, subject: horse, object: carriage, ...>, for attributes is <carriage, green>, for object is <object_id, width: ..., length: ..., x:..., y:...>
  • Figure 5: Example of the Appen question interface. The golden label of this question is "No".
  • Figure : Active-Passive voice T: the womanfeeds the man. T: the man is fed by the woman. F: the manfeeds the woman. F: the woman is fed by the man.
  • Figure : Active-Passive T: the gentleman kisses the woman. T: the woman is kissed by the gentleman. F: the woman kisses the gentleman. F: the gentleman is kissed by the woman.