Table of Contents
Fetching ...

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang

TL;DR

Bongard-OpenWorld presents a real-world, open-vocabulary, few-shot visual reasoning benchmark inspired by Bongard Problems, pairing open-ended concepts with distractors and hard negatives to probe concept induction. It builds 1.01K problems by mining open-form concepts from CC-3M and augmenting them with crowd-sourced commonsense knowledge, forming 2-way, 6-shot episodes with two query images per problem. The authors evaluate four model families—canonical few-shot learners, single-round and multi-round VLM-LLM reasoning, and a neuro-symbolic approach—revealing that even advanced systems lag behind human performance (best learner ~64 vs humans ~91). Results show open-world concepts, long concept lengths, and commonsense knowledge significantly increase difficulty, and while open-vocabulary pretraining helps, robust multi-image reasoning and explicit concept induction remain essential for closing the gap.

Abstract

We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

TL;DR

Bongard-OpenWorld presents a real-world, open-vocabulary, few-shot visual reasoning benchmark inspired by Bongard Problems, pairing open-ended concepts with distractors and hard negatives to probe concept induction. It builds 1.01K problems by mining open-form concepts from CC-3M and augmenting them with crowd-sourced commonsense knowledge, forming 2-way, 6-shot episodes with two query images per problem. The authors evaluate four model families—canonical few-shot learners, single-round and multi-round VLM-LLM reasoning, and a neuro-symbolic approach—revealing that even advanced systems lag behind human performance (best learner ~64 vs humans ~91). Results show open-world concepts, long concept lengths, and commonsense knowledge significantly increase difficulty, and while open-vocabulary pretraining helps, robust multi-image reasoning and explicit concept induction remain essential for closing the gap.

Abstract

We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
Paper Structure (27 sections, 16 figures, 9 tables, 2 algorithms)

This paper contains 27 sections, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: Task illustration of Bongard-OpenWorld. Given two set of images $\mathcal{P}$ and $\mathcal{N}$, the model needs to identify which set the query image $I_q$ belongs to by inferring the concepts $\mathcal{C}$ that is exclusively depicted by $\mathcal{P}$. Note that the captions and the concepts $\mathcal{C}$ won't be provided to the model. To further increase the difficulty of our task, we introduce distractors as additional contents of the positive images other than the concept $\mathcal{C}$, and hard negatives to ensure the content of negative images RGB]255,180,180partially overlaps with the concepts $\mathcal{C}$. These practices could force the model to reason about the visual concepts by contrasting the positives and the negatives.
  • Figure 2: Statistics of Bongard-OpenWorld. Our benchmark exhibits a range of concept lengths, spanning from 2 to 5 (as depicted in subfigure a), with an average length of 3.3. As demonstrated in \ref{['tab:concept_cat']}, crowd-sourced commonsense concepts take ID 1 9, with 0 indicating "anything else" (as depicted in subfigure b). While some words are more frequent (see the word cloud, as depicted in subfigure c), the overall frequency of words in Bongard-OpenWorld concepts follows a long-tailed distribution.
  • Figure 3: Models for Bongard-OpenWorld. We explore four families of approaches: (a) casting Bongard-OpenWorld into a standard "2-way, 6-shot" few-shot learning problem and tackling it using state-of-the-art few-shot learners with pretrained image representations; (b) combining an LLM (reasoner) and a VLM (image captioner) in a single round fashion, where the VLM simply caption each Bongard image and send their captions to LLM for solving this problem; (c) extending the method in (b) to multiple rounds, where the LLM will also iteratively probe the VLM for more image details, resulting in more condense information for solving Bongard; (d) neuro-symbolic approach, where a VLM generates the initial captions, then an LLM extracts visual concepts from them. These concepts are subsequently updated through logical operations, leveraging the responses provided by VLM, until the problem is solved. Zoom in for a better view.
  • Figure 4: GPT-4 correctly produces both binary prediction and induced visual concepts.
  • Figure 5: BLIP-2 only covers unhelpful content of $I_q$, GPT-4 makes correct concept induction but fails on binary prediction.
  • ...and 11 more figures