Table of Contents
Fetching ...

A Neurosymbolic Agent System for Compositional Visual Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

TL;DR

VLAgent addresses the challenge of compositional visual reasoning by introducing a two-stage neuro-symbolic agent that first plans with a structured symbolic program and then executes it through a backend that maps the plan to executable code. Key innovations include the SS-Parser for syntax and semantic repair of LLM-generated plans and an output verifier framework comprising caption-based checks and ensemble model verification, plus a long-video optimization strategy. Empirical results on six visual benchmarks show VLAgent achieving strong zero-shot performance and competitive results relative to supervised baselines, with ablations confirming the value of each verification component. The approach delivers robust generalization, interpretability, and practical efficiency for complex visual-question-answering and video reasoning tasks, highlighting the benefits of integrating symbolic planning with neural modules in vision-language systems.

Abstract

The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.

A Neurosymbolic Agent System for Compositional Visual Reasoning

TL;DR

VLAgent addresses the challenge of compositional visual reasoning by introducing a two-stage neuro-symbolic agent that first plans with a structured symbolic program and then executes it through a backend that maps the plan to executable code. Key innovations include the SS-Parser for syntax and semantic repair of LLM-generated plans and an output verifier framework comprising caption-based checks and ensemble model verification, plus a long-video optimization strategy. Empirical results on six visual benchmarks show VLAgent achieving strong zero-shot performance and competitive results relative to supervised baselines, with ablations confirming the value of each verification component. The approach delivers robust generalization, interpretability, and practical efficiency for complex visual-question-answering and video reasoning tasks, highlighting the benefits of integrating symbolic planning with neural modules in vision-language systems.

Abstract

The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Examples illustrating performance of eight VLMs on NeXT-QA compared with VLAgent.
  • Figure 2: A two-stage Neurosymbolic architecture of the VLAgent system with front-end engine and backend engine working in concert.
  • Figure 3: Core Modules in VLAgent $\alpha$ release. NLVR2 uses VQA, EVAL and RESULT. VideoQA uses modules taking video input plus SELECT and EVAL. HC-RefLOCO (referring expression) uses LOC, CAP, FIND, VOTE.
  • Figure 4: VLAgent SS-Parser corrects the reasoning error detected in the LLM-generated planning script.
  • Figure 5: Visual comparison on six image QA examples
  • ...and 5 more figures