Table of Contents
Fetching ...

$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S

TL;DR

This work targets the challenge of understanding Rebus Puzzles with Vision-Language Systems. It introduces |BUS|, a large, diverse benchmark containing 1,333 English Rebus Puzzles across 18 categories, enriched with rich metadata and 611 ControlNet-augmented samples to raise difficulty. The authors propose RebusDescProgICE, a model-agnostic in-context reasoning framework that pairs unstructured image descriptions with structured, code-based reasoning and a novel in-context example selection strategy. Across both closed-source (e.g., GPT-4o family) and open-source (e.g., Phi-3.5-Vision, Pixtral, Qwen2-VL-7B) models, RebusDescProgICE yields notable improvements in substring accuracy and Word-Level F1 (e.g., 2.1–4.1% and 20–30% relative gains, respectively), highlighting the importance of combining descriptive grounding with code-based reasoning for challenging multimodal reasoning tasks. Overall, |BUS| provides a tough, model-agnostic benchmark for evaluating and pushing the reasoning capabilities of vision-language systems on creative language-world puzzles.

Abstract

Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and $20-30\%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

TL;DR

This work targets the challenge of understanding Rebus Puzzles with Vision-Language Systems. It introduces |BUS|, a large, diverse benchmark containing 1,333 English Rebus Puzzles across 18 categories, enriched with rich metadata and 611 ControlNet-augmented samples to raise difficulty. The authors propose RebusDescProgICE, a model-agnostic in-context reasoning framework that pairs unstructured image descriptions with structured, code-based reasoning and a novel in-context example selection strategy. Across both closed-source (e.g., GPT-4o family) and open-source (e.g., Phi-3.5-Vision, Pixtral, Qwen2-VL-7B) models, RebusDescProgICE yields notable improvements in substring accuracy and Word-Level F1 (e.g., 2.1–4.1% and 20–30% relative gains, respectively), highlighting the importance of combining descriptive grounding with code-based reasoning for challenging multimodal reasoning tasks. Overall, |BUS| provides a tough, model-agnostic benchmark for evaluating and pushing the reasoning capabilities of vision-language systems on creative language-world puzzles.

Abstract

Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present , a large and diverse benchmark of English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose , a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on by and using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example of a Rebus Puzzle in |sync-alt bus|
  • Figure 2: Annotation Pipeline of |sync-alt bus| Dataset
  • Figure 3: Breakdown of some important Rebus Puzzle Metadata Characteristics
  • Figure 4: Distribution of Rebus Puzzles based on their category
  • Figure 5: 2D UMAP Representations of CLIP Image representations of Rebus Puzzle Images