A Neurosymbolic Agent System for Compositional Visual Reasoning

Yichang Xu; Gaowen Liu; Ramana Rao Kompella; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Zachary Yahn; Ling Liu

A Neurosymbolic Agent System for Compositional Visual Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

TL;DR

VLAgent addresses the challenge of compositional visual reasoning by introducing a two-stage neuro-symbolic agent that first plans with a structured symbolic program and then executes it through a backend that maps the plan to executable code. Key innovations include the SS-Parser for syntax and semantic repair of LLM-generated plans and an output verifier framework comprising caption-based checks and ensemble model verification, plus a long-video optimization strategy. Empirical results on six visual benchmarks show VLAgent achieving strong zero-shot performance and competitive results relative to supervised baselines, with ablations confirming the value of each verification component. The approach delivers robust generalization, interpretability, and practical efficiency for complex visual-question-answering and video reasoning tasks, highlighting the benefits of integrating symbolic planning with neural modules in vision-language systems.

Abstract

The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.

A Neurosymbolic Agent System for Compositional Visual Reasoning

TL;DR

Abstract

A Neurosymbolic Agent System for Compositional Visual Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)