Table of Contents
Fetching ...

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari

TL;DR

This work tackles 3D spatial reasoning by introducing VADAR, a training-free agentic framework that dynamically constructs a Python-based API to decompose complex visual-grounding tasks. The method divides into API Generation, where LLM agents create and implement reusable functions, and Program Synthesis, where a Program Agent writes executable Python code that a separate Execution Agent runs with vision specialists. Empirical results show that a dynamic API substantially outperforms static DSL baselines and remains competitive with monolithic vision-language models on Omni3D-Bench, with oracle-vision analyses revealing vision components as the principal bottleneck. The work also provides a dedicated Omni3D-Bench benchmark to stress 3D grounding and inference, highlighting the potential of training-free, interpretable, agentic reasoning for scalable 3D spatial understanding and suggesting directions for improving specialized vision models and integration strategies.

Abstract

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/

Visual Agentic AI for Spatial Reasoning with a Dynamic API

TL;DR

This work tackles 3D spatial reasoning by introducing VADAR, a training-free agentic framework that dynamically constructs a Python-based API to decompose complex visual-grounding tasks. The method divides into API Generation, where LLM agents create and implement reusable functions, and Program Synthesis, where a Program Agent writes executable Python code that a separate Execution Agent runs with vision specialists. Empirical results show that a dynamic API substantially outperforms static DSL baselines and remains competitive with monolithic vision-language models on Omni3D-Bench, with oracle-vision analyses revealing vision components as the principal bottleneck. The work also provides a dedicated Omni3D-Bench benchmark to stress 3D grounding and inference, highlighting the potential of training-free, interpretable, agentic reasoning for scalable 3D spatial understanding and suggesting directions for improving specialized vision models and integration strategies.

Abstract

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/

Paper Structure

This paper contains 17 sections, 16 figures, 7 tables, 2 algorithms.

Figures (16)

  • Figure 1: Spatial reasoning in 3D is challenging as it requires multiple steps of grounding and inference. We introduce a benchmark for 3D understanding with complex queries; an example is shown here. To tackle these queries we propose a training-free agentic approach, VADAR, that dynamically generates new skills in Python and thus can handle a wider range of queries compared to prior methods.
  • Figure 2: Overview. VADAR consists of an API generation stage and a program synthesis stage. The Signature & Implementation Agents generate an API that is used by the Program Agent to produce a program to answer the question, executed by the Execution Agent.
  • Figure 3: LEFT whatsleft vs VADAR on CLEVR. LEFT requires supervision. We vary the amount of training data (x-axis) and report accuracy (y-axis). VADAR requires no supervision but takes in 15 queries without answers to guide the creation of the API. VADAR outperforms LEFT trained with $\leq 10,000$ supervised examples.
  • Figure 4: Program outputs for VisProg, ViperGPT and VADAR. For each example, we show the query, the input image, and the method's program generations. Queries are from our benchmark and pertain to 3D understanding of scenes. Zoom-in to read the programs.
  • Figure 5: (a) The No-API agent produces longer programs and is prone to errors, often mistakenly using depth for left/right comparisons. (b) In contrast, our agentic VADAR creates shorter programs by leveraging methods from the API.
  • ...and 11 more figures