Table of Contents
Fetching ...

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

TL;DR

Argus targets the limitations of vision-centric reasoning in multimodal large language models by introducing a grounding-driven visual attention mechanism that uses object-centric RoI grounding as visual chain-of-thought signals. It integrates a mixture-of-vision-experts encoder suite with an LLM decoder and employs explicit RoI search and two modes of visual context re-engagement (re-encoding and re-sampling) guided by language prompts. Across diverse benchmarks, Argus achieves state-of-the-art performance among public models of similar scale and demonstrates strong object grounding alongside multimodal reasoning, highlighting the value of explicit visual CoT signals. The work underlines a shift toward vision-centric multimodal intelligence and suggests future directions in scaling visual CoT data and exploring open-world perception tasks.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

TL;DR

Argus targets the limitations of vision-centric reasoning in multimodal large language models by introducing a grounding-driven visual attention mechanism that uses object-centric RoI grounding as visual chain-of-thought signals. It integrates a mixture-of-vision-experts encoder suite with an LLM decoder and employs explicit RoI search and two modes of visual context re-engagement (re-encoding and re-sampling) guided by language prompts. Across diverse benchmarks, Argus achieves state-of-the-art performance among public models of similar scale and demonstrates strong object grounding alongside multimodal reasoning, highlighting the value of explicit visual CoT signals. The work underlines a shift toward vision-centric multimodal intelligence and suggests future directions in scaling visual CoT data and exploring open-world perception tasks.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/

Paper Structure

This paper contains 24 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Visual question answering, grounding, and chain-of-thought reasoning with Argus. "ctx-token" is short for context token.
  • Figure 2: Illustration of two visual attention mechanisms. Involuntary Attention (Left): stimulus-driven; unconditioned feature extraction; salient objects. Direct Attention (Right): Goal-driven; language-guided region-of-interest (RoI) feature extraction.
  • Figure 3: Illustration of Argus architecture. In addition to standard unconditioned visual tokenization, our method incorporates an additional goal-directed visual tokenization procedure. The model has the ability to ground most relevant region-of-interest (RoI) conditioned on the multimodal input instructions. Then, the visual RoI is sampled from the input image, and fed to the RoI re-engagement module to extract another set of visual tokens as CoT context for reasoning.
  • Figure 4: Illustration of two visual CoT mechanisms. Re-encoding expand the RoI and treat it as a new image for tokenization. Re-sampling retrieves knowledge from the pre-extracted token cache.
  • Figure 5: Qualitative evaluation of Argus. We achieve superior performance in challenging multimodal reasoning and perception tasks.
  • ...and 3 more figures