MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg
TL;DR
MEBench tackles mutual exclusivity bias in vision-language models by embedding ME reasoning into spatially contextual scenes and evaluating seven SOTA baselines with a synthetic data generator. The benchmark formalizes a three-part task—localizing known objects, assigning a novel label under ME, and using spatial cues to resolve ambiguity when multiple novel objects appear—enabled by a configurable data-pipeline and variants. Results show that current VLMs often exhibit weak ME bias but can leverage spatial descriptions to improve disambiguation, with certain models like CogVLM excelling at localization yet lagging in spatial reasoning. This work provides a pathway toward more human-like zero-shot generalization in multimodal systems and offers open data and tooling to facilitate reproducibility and further research.
Abstract
This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.
