Table of Contents
Fetching ...

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi

TL;DR

HYDRA tackles the limitations of monolithic and purely LLM-driven visual reasoning by introducing a hyper-agent architecture that dynamically orchestrates planning, reasoning, and perception through an RL-controlled loop. By generating multiple instruction samples with varying depth and using a learning-based controller to select against historical feedback, HYDRA performs incremental reasoning with a State Memory Bank that stores prior outputs and perception results, enabling robust explanations and corrections via perception feedback. The approach achieves state-of-the-art results on several VR benchmarks (e.g., OK-VQA, GQA, RefCOCO/RefCOCO+) and demonstrates strong ablation performance, validating the contribution of the RL controller, sampling strategy, and incremental reasoning. Overall, HYDRA offers a scalable, generalizable framework that leverages LLMs for planning and code generation while relying on cognitive control and visual feedback to improve reliability, efficiency, and cross-domain generalization in visual reasoning tasks.

Abstract

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

TL;DR

HYDRA tackles the limitations of monolithic and purely LLM-driven visual reasoning by introducing a hyper-agent architecture that dynamically orchestrates planning, reasoning, and perception through an RL-controlled loop. By generating multiple instruction samples with varying depth and using a learning-based controller to select against historical feedback, HYDRA performs incremental reasoning with a State Memory Bank that stores prior outputs and perception results, enabling robust explanations and corrections via perception feedback. The approach achieves state-of-the-art results on several VR benchmarks (e.g., OK-VQA, GQA, RefCOCO/RefCOCO+) and demonstrates strong ablation performance, validating the contribution of the RL controller, sampling strategy, and incremental reasoning. Overall, HYDRA offers a scalable, generalizable framework that leverages LLMs for planning and code generation while relying on cognitive control and visual feedback to improve reliability, efficiency, and cross-domain generalization in visual reasoning tasks.

Abstract

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.
Paper Structure (16 sections, 5 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of ViperGPT suris2023vipergpt, IdealGPT you2023idealgpt, and HYDRA: ViperGPT employs a single feed-forward process approach, IdealGPT breaks down questions into sub-questions using a loop, while HYDRA utilizes diverse instructions and an RL agent in an incremental loop for feedback, showcasing its superior adaptability and efficiency in handling complex visual reasoning challenges.
  • Figure 2: The HYDRA detailed design includes key modules: planner, controller, reasoner, textualizer, State Memory Bank ($s^{t-1}$), and meta information ($\eta$). Input Q is given to the planner to generate instructions $D^t$ using $s^{t-1}$ and $\eta$. The controller receives $D^t$, and if invalid, requests alternative samples from the planner. Otherwise, it sends chosen instruction $d^t_*$ to the reasoner, which generates perceptual output using Python APIs and VFMs. Incomplete output is converted to textual format, $f^t$, by the textualizer and stored in State Memory Bank. This process iterates until the desired final output, $\hat{Y}$, is achieved.
  • Figure 3: Detailed result examples from HYDRA. The first example describes the intermediate results of the full two iterations in the loop for question answering, whereas the second example is about the grounding task.
  • Figure 4: More result examples from HYDRA for question answering and visual grounding tasks.
  • Figure 5: Failure result examples from HYDRA. The left two samples are due to wrong generating codes. The right two failure cases are due to wrong annotation.
  • ...and 4 more figures