Table of Contents
Fetching ...

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng

TL;DR

ExoViP tackles two core weaknesses in vision-language programming for compositional visual reasoning: planning errors from LLMs and execution errors from vision modules. It introduces exoskeleton verification modules—image-text matching, image captioning, and VQA—to validate predictions after each reasoning step and calibrate or correct outputs, while a tree-based search guided by verification and LLM self-correction refines the reasoning trace. The method, applied to VisProg and ViperGPT across six tasks, yields consistent improvements in accuracy and planning efficiency, reduces unexecutable plans, and demonstrates robustness via negative sampling and PSC. This plug-and-play verification framework enhances generalization in open-domain multimodal reasoning and offers a practical path to more reliable multi-step VL programs.

Abstract

Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models (LLMs), this multi-modal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., vision-language (VL) programming. Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models. In this work, we devise a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current VL programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe that ExoViP can foster better performance and generalization on open-domain multi-modal challenges.

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

TL;DR

ExoViP tackles two core weaknesses in vision-language programming for compositional visual reasoning: planning errors from LLMs and execution errors from vision modules. It introduces exoskeleton verification modules—image-text matching, image captioning, and VQA—to validate predictions after each reasoning step and calibrate or correct outputs, while a tree-based search guided by verification and LLM self-correction refines the reasoning trace. The method, applied to VisProg and ViperGPT across six tasks, yields consistent improvements in accuracy and planning efficiency, reduces unexecutable plans, and demonstrates robustness via negative sampling and PSC. This plug-and-play verification framework enhances generalization in open-domain multimodal reasoning and offers a practical path to more reliable multi-step VL programs.

Abstract

Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models (LLMs), this multi-modal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., vision-language (VL) programming. Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models. In this work, we devise a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current VL programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe that ExoViP can foster better performance and generalization on open-domain multi-modal challenges.
Paper Structure (49 sections, 1 equation, 22 figures, 9 tables, 1 algorithm)

This paper contains 49 sections, 1 equation, 22 figures, 9 tables, 1 algorithm.

Figures (22)

  • Figure 1: An overview of ExoViP. The prediction after each step is verified by the proposed "exoskeleton" verification modules, which contain a mix of three sub-verifiers. The verified scores help correct the errors in the vision module predictions or refine the reasoning programs planned by LLM.
  • Figure 2: Search of the reasoning trace. We beam search through the program tree, based on the verification scores as well as the LLM self-correctness.
  • Figure 3: Qualitative results of text-guided image editing on MagicBrush
  • Figure 4: Distribution of the failure cases of original VisProg (left), and distribution of the failure cases of ExoViP (right)
  • Figure 5: Accuracy on GQA positively correlates with the verification scores.
  • ...and 17 more figures