Table of Contents
Fetching ...

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko

TL;DR

KiVA introduces a kid-inspired benchmark to probe visual analogical reasoning in large multimodal models using real-world objects. It employs a three-stage paradigm—what changed, how it changed, and extrapolation—to compare LMMs against children and adults across five transformation domains. Findings show that while models can detect basic changes, they struggle to specify and extrapolate rules, with children and adults significantly outperforming them, especially on spatial and numerical tasks. The work highlights the limitations of 2D training data for complex visual reasoning and points to future directions such as symbolic visual representations and Bayesian inference to improve generalization.

Abstract

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

TL;DR

KiVA introduces a kid-inspired benchmark to probe visual analogical reasoning in large multimodal models using real-world objects. It employs a three-stage paradigm—what changed, how it changed, and extrapolation—to compare LMMs against children and adults across five transformation domains. Findings show that while models can detect basic changes, they struggle to specify and extrapolate rules, with children and adults significantly outperforming them, especially on spatial and numerical tasks. The work highlights the limitations of 2D training data for complex visual reasoning and points to future directions such as symbolic visual representations and Bayesian inference to improve generalization.

Abstract

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.
Paper Structure (21 sections, 14 figures)

This paper contains 21 sections, 14 figures.

Figures (14)

  • Figure 1: KiVA: Kid-inspired Visual Analogies.(a) 5 visual analogy domains examined in KiVA and KiVA-adults (see Figure \ref{['fig:pipeline']} for the full task format). Unlike KiVA, the starting color, size, orientation and number of test objects in KiVA-adults further differ from the starting values of the given transformations. (b) Performance of children, adults & LMMs in extrapolating a transformation rule to a novel object in KiVA (top) and KiVA-adults (bottom).
  • Figure 2: Prior benchmarks versus KiVA for visual analogies.(a) Prior benchmarks like I. ConceptARC, II. Raven's Progressive Matrices, and III. CCSE Reasoning involve arbitrary changes of abstract shapes and grids. (b) KiVA examines basic changes that even three-year-olds can solve.
  • Figure 3: An example of a trial in KiVA. Models and humans are first asked to classify a given transformation (left). If the classification is correct (green arrow), humans and models are further evaluated on their verbal specification of the transformation (middle) and then on visual extrapolation (right). Otherwise, humans and models skip to make a visual extrapolation (yellow arrow).
  • Figure 4: Human and model performance in KiVA sorted by Transformation Domain and color coded by Question Type in samples annotated by children (top figure) and in the full benchmark annotated by adults (bottom figure). Error bars represent standard errors across object variations. Chance level is $25\%$ for Verbal Classification; $33\%$ for Verbal Specification and Visual Extrapolation.
  • Figure 5: Adult and model performance in KiVA-adults sorted by Transformation Domain and color coded by Question Type. Error bars and chance levels are as described in Figure \ref{['fig:result']}.
  • ...and 9 more figures