$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S
TL;DR
This work targets the challenge of understanding Rebus Puzzles with Vision-Language Systems. It introduces |BUS|, a large, diverse benchmark containing 1,333 English Rebus Puzzles across 18 categories, enriched with rich metadata and 611 ControlNet-augmented samples to raise difficulty. The authors propose RebusDescProgICE, a model-agnostic in-context reasoning framework that pairs unstructured image descriptions with structured, code-based reasoning and a novel in-context example selection strategy. Across both closed-source (e.g., GPT-4o family) and open-source (e.g., Phi-3.5-Vision, Pixtral, Qwen2-VL-7B) models, RebusDescProgICE yields notable improvements in substring accuracy and Word-Level F1 (e.g., 2.1–4.1% and 20–30% relative gains, respectively), highlighting the importance of combining descriptive grounding with code-based reasoning for challenging multimodal reasoning tasks. Overall, |BUS| provides a tough, model-agnostic benchmark for evaluating and pushing the reasoning capabilities of vision-language systems on creative language-world puzzles.
Abstract
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and $20-30\%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.
