Table of Contents
Fetching ...

Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk

TL;DR

It is revealed that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

Abstract

Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

TL;DR

It is revealed that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

Abstract

Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

Paper Structure

This paper contains 49 sections, 21 figures, 16 tables, 2 algorithms.

Figures (21)

  • Figure 1: Bongard Problems. All matrices present the same abstract concept: Left side: Arrows pointing in different directions. Right side: Arrows pointing in the same direction. (a) A manually-designed synthetic BP bongard1970patternfoundalis2021index. (b) A manually-designed real-world representation from Bongard-RWR malkinski2024reasoning. (c) An automatically generated real-world representation from Bongard-RWR+.
  • Figure 2: Related BP datasets. (a) A synthetic BP from Bongard-LOGO nie2020bongard; Left: Shapes are the same. Right: Shapes are different. (b) A real-world BP from Bongard HOI jiang2022bongard; Left: A person driving a car. Right: Not a person driving a car. (c) A real-world BP from Bongard-OpenWorld wu2024bongardopenworld; Left: The top of a snow-covered mountain. Right: Not the top of a snow-covered mountain. Unlike Bongard-LOGO, which involves synthetic images unfamiliar to VLMs, or Bongard HOI and Bongard-OpenWorld, which focus on coarse-grained concepts, Bongard-RWR+ is designed around abstract concepts expressed through realistic images that require fine-grained visual reasoning.
  • Figure 3: Generative pipeline. Starting from a Bongard problem $\mathcal{BP}$ with concept $(C_L, C_R)$, the pipeline: (1) describes each image using an I2T model to produce paired positive/negative captions $\mathcal{L}^+_i$ and $\mathcal{L}^-_i$; (2) augments each positive caption with a T2T model into $N$ diverse descriptions $\{\mathcal{L}^+_{i,j}\}_{j=1}^N$ that preserve the underlying concept; (3) generates candidate images for each new description using a T2I model; and (4) involves a human judge to review and filter the generated images. For readability, the figure illustrates the processing flow for the first image from the left matrix side.
  • Figure 4: BP formulations. We define six tasks of variable complexity: (a; I1S) assign a single test image to the left or right side; (b; I2S) assign a pair of test images to the respective sides; (c; D1S / d; D2S) use descriptions from an I2T model and classify with a T2T model; (e; CS) select the correct concept index $\widehat{k}$ such that $C_{\widehat{k}} = C^*$; (f; CG) generate a natural language description of concept $\widehat{C}$.
  • Figure 5: Concept Selection. Accuracy in the CS task on Bongard-RWR+ for $K \in \{2, 4, 8, 16\}$.
  • ...and 16 more figures