Table of Contents
Fetching ...

HAMMR: HierArchical MultiModal React agents for generic VQA

Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, Jasper Uijlings

TL;DR

This work poses the VQA problem from a unified perspective and evaluates a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more and introduces HAMMR: HierArchical MultiModal React.

Abstract

Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%.

HAMMR: HierArchical MultiModal React agents for generic VQA

TL;DR

This work poses the VQA problem from a unified perspective and evaluates a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more and introduces HAMMR: HierArchical MultiModal React.

Abstract

Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%.
Paper Structure (13 sections, 9 figures, 5 tables)

This paper contains 13 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example of HAMMR.We introduce HAMMR: HierArchical MultiModal React for generic visual question answering (VQA). Our work is an evolution of the LLM+tools approach gupta23cvpr_visproghu23neurips_avissuris23iccv_vipergpt, where we design a single system which can handle a large variety of VQA tasks. HAMMR is a multimodal version of ReAct yao23iclr_react where agents themselves can act as tools. This results in a hierarchical and highly compositional approach where high-level HAMMR agents can call lower-level agents dedicated to more specific tasks. This figure shows how HAMMR solves a two hop Encyclopedic-VQA mensink23iccv_encvqa question. Our high-level agent determines the question type and calls the corresponding encyclopedic two hop agent, which calls the single hop encyclopedic agent to solve the first part of the composite question.
  • Figure 2: Illustration of our approach. Top: the common approach hu23neurips_avisgupta23cvpr_visprogsuris23iccv_vipergpt is to create a specialist orchestrator agent for each individual task. Bottom left: To create a generalist orchestrator agent, the straightforward approach is to collect all tool descriptions and all in-context examples of each individual specialist. Bottom right: We propose HAMMR: we allow ReAct agents themselves to be called as tools. This leads to a hierarchical and modular approach where high-level ReAct agents can call agents which have a more specific task. For our generic VQA setting, our high-level agent determines the type of VQA question, after which it dispatches the question to the corresponding specialist HAMMR agent.
  • Figure 3: Question type examples. Example images and questions from the six datasets spanning 8 question types used to evaluate HAMMR.
  • Figure 4: Example of specialist HAMMR agents solving PointQA LookTwice To solve this question, the agent first takes a crop of the image to focus on the object of interest. Interestingly, the LLM orchestrator correctly identifies that "minute hand, big hand" refers to a clock, recovering from an imperfect response of the VQA tool.
  • Figure A.1: Example of HAMMR answering a question that compares two images. The HAMMR QuestionDispatcherAgent first identifies that the question requires verifying and comparing properties of two images, and routes it to the appropriate agent. This agent starts by verifying the first statement - one image contains a sofa and pillows in only neutral brownish shades, which the first image fulfills. Then it checks the second property - the other image includes some blue element - on both images, which neither of them fulfills and therefore determines that the statement cannot be true.
  • ...and 4 more figures