Table of Contents
Fetching ...

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Aleksandar Stanić, Sergi Caelles, Michael Tschannen

TL;DR

The paper tackles the challenge of compositional visual reasoning by shifting from monolithic end-to-end models to LLMs that program visual tools. It introduces an Abstract API with spatially/temporally abstract routines, automatic generation of in-context examples (ACEs) from a small labeled set, and self-correction mechanisms (self-debugging and self-tuning) to reduce human engineering and enable near zero-shot performance. Across four datasets—RefCOCO, RefCOCO+, GQA, and NExT-QA—it demonstrates consistent gains from the Abstract API and ACEs, with self-tuning delivering additional improvements and self-debugging providing mixed results. The work suggests that combining structured tool use with automated example generation and adaptive correction can significantly improve robustness and generalization in visual reasoning tasks, moving closer to truly zero-shot compositional reasoning in real-world settings.

Abstract

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

TL;DR

The paper tackles the challenge of compositional visual reasoning by shifting from monolithic end-to-end models to LLMs that program visual tools. It introduces an Abstract API with spatially/temporally abstract routines, automatic generation of in-context examples (ACEs) from a small labeled set, and self-correction mechanisms (self-debugging and self-tuning) to reduce human engineering and enable near zero-shot performance. Across four datasets—RefCOCO, RefCOCO+, GQA, and NExT-QA—it demonstrates consistent gains from the Abstract API and ACEs, with self-tuning delivering additional improvements and self-debugging providing mixed results. The work suggests that combining structured tool use with automated example generation and adaptive correction can significantly improve robustness and generalization in visual reasoning tasks, moving closer to truly zero-shot compositional reasoning in real-world settings.

Abstract

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.
Paper Structure (35 sections, 5 figures, 6 tables)

This paper contains 35 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) RefCOCO yu2016modeling example image. (b) A code-generating LLM takes as input the query, the Python API (functions for "tool use" and Abstract API routines (functions) we introduce in \ref{['sec:abstract_api']}) and a number of ICEs (we replace human-engineered ICEs by automatically-generated ACEs in \ref{['sec:method_aice']}). The LLM generates code that takes as input the image and outputs an answer (here a bounding box). If code fails to run, "self-tuning" (\ref{['sec:self-correction']}) can adjust parameters and generate new code.
  • Figure 2: Using our Abstract API improves performance over the ViperGPT API across all datasets. Similarly, ACEs consistently improve performance, and these gains compound with the gains from the Abstract API. Uncertainty bars represent standard deviations computed over three random seeds.
  • Figure 3: Increasing the number of ACEs in the prompt improves performance. Note that using the ViperGPT API on NExT-QA results in only three correct ACEs, so the performance plateaus after four ACEs.
  • Figure 4: Increasing the number of "self-tuning" steps leads to improved performance. Our Abstract API (Abs. API) consistently outperforms the ViperGPT API (Vip. API). The best performance is achieved when using dynamic object detector threshold (Dyn.t) in addition to the Abstract API with ACE.
  • Figure 5: Error diagrams for the ViperGPT API and our Abstract API. We visualize the percentages of samples with IoU in certain ranges. "Err" classes are samples for which code execution failed due to either: object detection (Obj.Det), wrong return type (Ret.Type) or some other error (Other) e.g. hallucination.