Table of Contents
Fetching ...

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

TL;DR

MEBench tackles mutual exclusivity bias in vision-language models by embedding ME reasoning into spatially contextual scenes and evaluating seven SOTA baselines with a synthetic data generator. The benchmark formalizes a three-part task—localizing known objects, assigning a novel label under ME, and using spatial cues to resolve ambiguity when multiple novel objects appear—enabled by a configurable data-pipeline and variants. Results show that current VLMs often exhibit weak ME bias but can leverage spatial descriptions to improve disambiguation, with certain models like CogVLM excelling at localization yet lagging in spatial reasoning. This work provides a pathway toward more human-like zero-shot generalization in multimodal systems and offers open data and tooling to facilitate reproducibility and further research.

Abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

TL;DR

MEBench tackles mutual exclusivity bias in vision-language models by embedding ME reasoning into spatially contextual scenes and evaluating seven SOTA baselines with a synthetic data generator. The benchmark formalizes a three-part task—localizing known objects, assigning a novel label under ME, and using spatial cues to resolve ambiguity when multiple novel objects appear—enabled by a configurable data-pipeline and variants. Results show that current VLMs often exhibit weak ME bias but can leverage spatial descriptions to improve disambiguation, with certain models like CogVLM excelling at localization yet lagging in spatial reasoning. This work provides a pathway toward more human-like zero-shot generalization in multimodal systems and offers open data and tooling to facilitate reproducibility and further research.

Abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

Paper Structure

This paper contains 25 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Mutual Exclusivity Bias Evaluation Settings. (a) Traditional ME bias evaluation in developmental psychology and early computational studies gandhi2020mutual, (b) MEBench setup for classic ME bias testing, and (c) MEBench setup for evaluating ME bias in conjunction with spatial reasoning.
  • Figure 2: Example of Rendered Data for the MEBench Benchmark. We systematically generate diverse object configurations within varied room backgrounds, ensuring photorealistic renderings that capture realistic spatial arrangements and lighting conditions.
  • Figure 3: Novel Objects in MEBench. To prevent data leakage during evaluation, we constructed a database of novel objects using procedural generation in Blender blender with geometry nodes from GeoShapeV2 geoshapes and Thingi10K zhou2016thingi10k. These objects serve as unknown instances, ensuring that models are tested on truly unseen categories.
  • Figure 4: Object Detection Performance of VLMs on Known Objects in the (\ref{['fig:1K0U']}) 1K-0U (1 known and 0 unknown object), (\ref{['fig:1K1U']}) 1K-1U (1 known and 1 unknown objects), and (\ref{['fig:2K1U']}) 2K-1U (2 known and 1 unknown objects).
  • Figure 5: Mutual Exclusivity (ME) Analysis in the (Left) 1K-1U and (Middle) 2K-1U settings. These settings contain one novel object in the scene. The response types are categorized as follows: $N\rightarrow N$ denotes correctly assigning the novel label to the novel object, $N\rightarrow K$ represents misassigning the novel label to a known object, and $N\rightarrow Bg$ indicates misassigning the novel label to a background distractor or failing to detect high-quality object bounding boxes. Additionally, No Prediction indicates cases where the model fails to produce a bounding box for the referred object. (Right) ME Scores of 1K-1U and 2K-1U settings. Higher scores indicate stronger ME bias.
  • ...and 3 more figures