Table of Contents
Fetching ...

What Makes a Maze Look Like a Maze?

Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu

TL;DR

DSG tackles the challenge of visual reasoning over abstract concepts by introducing explicit schemas that decompose high-level concepts into subcomponents and dependencies. The framework extracts universal schemas with LLMs, hierarchically grounds them on images using VLMs, and uses the grounded schema to augment VQA. The Visual Abstractions Benchmark (VAB) provides a diverse, real-world testbed with 540 questions across 12 abstract concepts in four categories, enabling evaluation of grounding and reasoning. Empirically, DSG improves base methods across question types and models, with notable gains in counting and model-agnostic improvements, while also revealing limitations in spatial grounding and schema bias that guide future work.

Abstract

A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.

What Makes a Maze Look Like a Maze?

TL;DR

DSG tackles the challenge of visual reasoning over abstract concepts by introducing explicit schemas that decompose high-level concepts into subcomponents and dependencies. The framework extracts universal schemas with LLMs, hierarchically grounds them on images using VLMs, and uses the grounded schema to augment VQA. The Visual Abstractions Benchmark (VAB) provides a diverse, real-world testbed with 540 questions across 12 abstract concepts in four categories, enabling evaluation of grounding and reasoning. Empirically, DSG improves base methods across question types and models, with notable gains in counting and model-agnostic improvements, while also revealing limitations in spatial grounding and schema bias that guide future work.

Abstract

A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
Paper Structure (37 sections, 9 figures, 13 tables)

This paper contains 37 sections, 9 figures, 13 tables.

Figures (9)

  • Figure 1: There exist abstract concepts, such as "maze", which are defined by lifted symbols and patterns, instead of concrete visual features. We propose Deep Schema Grounding (DSG), a framework for visual reasoning over such abstract concepts, which uses schemas to structure models' interpretation of images. DSG hierarchically grounds conceptual schemas on images and uses them to provide holistic context to VLMs, improving performance across diverse downstream queries.
  • Figure 2: Deep Schema Grounding consists of three main stages: (1) extracting schemas of abstract concepts with large language models, (2) hierarchically grounding schemas on images with vision-language models, and (3) conducting visual question-answering augmented with resolved schemas.
  • Figure 3: Examples of schemas for concepts and the visual features that they may be grounded to.
  • Figure 4: The schema grounding process.
  • Figure 5: The Visual Abstractions Benchmark comprises diverse, real-world images that represent $12$ different abstract concepts across $4$ categories.
  • ...and 4 more figures