Table of Contents
Fetching ...

Support-Set Context Matters for Bongard Problems

Nikhil Raghuraman, Adam W. Harley, Leonidas Guibas

TL;DR

The paper shows that Bongard problems, which demand abstract concept induction from small positive/negative support sets, are solvable far more effectively when set-level context is leveraged. It introduces support-set standardization as a simple, parameter-free adaptation and also a Transformer-based approach (Prototype-Mimic and SVM-Mimic) to extract rules from supports, achieving new state-of-the-art results on Bongard-LOGO ($75.3\%$) and Bongard-HOI ($76.4\%$) and strong performance on Bongard-Classic. The findings indicate that context across multiple supports is a critical signal for visual abstract reasoning and that relatively lightweight techniques can substantially boost performance without shifting to larger backbones. The work highlights practical gains for few-shot reasoning tasks and informs future directions in combining context-aware adaptations with learned priors for symbolic-like visual tasks.

Abstract

Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, most existing methods have reached at best 69% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not adapt image features given information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because the "key concept" in a typical Bongard problem can often only be distinguished using multiple positives and multiple negatives. We explore simple methods to incorporate this context and show substantial gains over prior works, leading to new state-of-the-art accuracy on Bongard-LOGO (75.3%) and Bongard-HOI (76.4%) compared to methods with equivalent vision backbone architectures and strong performance on the original Bongard problem set (60.8%).

Support-Set Context Matters for Bongard Problems

TL;DR

The paper shows that Bongard problems, which demand abstract concept induction from small positive/negative support sets, are solvable far more effectively when set-level context is leveraged. It introduces support-set standardization as a simple, parameter-free adaptation and also a Transformer-based approach (Prototype-Mimic and SVM-Mimic) to extract rules from supports, achieving new state-of-the-art results on Bongard-LOGO () and Bongard-HOI () and strong performance on Bongard-Classic. The findings indicate that context across multiple supports is a critical signal for visual abstract reasoning and that relatively lightweight techniques can substantially boost performance without shifting to larger backbones. The work highlights practical gains for few-shot reasoning tasks and informs future directions in combining context-aware adaptations with learned priors for symbolic-like visual tasks.

Abstract

Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, most existing methods have reached at best 69% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not adapt image features given information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because the "key concept" in a typical Bongard problem can often only be distinguished using multiple positives and multiple negatives. We explore simple methods to incorporate this context and show substantial gains over prior works, leading to new state-of-the-art accuracy on Bongard-LOGO (75.3%) and Bongard-HOI (76.4%) compared to methods with equivalent vision backbone architectures and strong performance on the original Bongard problem set (60.8%).
Paper Structure (50 sections, 1 equation, 30 figures, 9 tables)

This paper contains 50 sections, 1 equation, 30 figures, 9 tables.

Figures (30)

  • Figure 1: Sample problem from Bongard-HOI. The positives share a concept, and the negatives lack that concept. Considering all supports and one query at a time, is the query a positive or negative example of the key concept? Our model's outputs, and ground truth, are in the footnote.
  • Figure 2: Vision Backbone and SVM-Mimic. Given "support" images labelled positive and negative, our goal is to obtain a hyperplane which can accurately classify new query images. We use a vision backbone to obtain an embedding for each support image, and then feed these in parallel to a Transformer encoder and to an SVM. Both modules output a hyperplane, and we penalize the cosine distance between them. For natural images, we use the encoder from CLIP; for geometric drawings, we train an encoder from scratch.
  • Figure 3: Robustness to number of supports in Bongard-HOI. The x-axis measures the number of supports of each class seen, with 6 being the default setting, and the y-axis measures accuracy. Means and standard deviations are across three runs, where randomness is with respect to the subset of supports chosen.
  • Figure 4: Robustness to label noise in Bongard-HOI. The x-axis measures the number of supports of each class that have not been noised. Other details are the same as in Figure \ref{['fig:robustness']}.
  • Figure 5: Bongard-HOI unseen act/unseen obj: Correct guess for both queries. The concept is "hold and about to eat apple."
  • ...and 25 more figures