Table of Contents
Fetching ...

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang

TL;DR

This work targets the core challenge of thinking with images by introducing O3-Bench, a hard benchmark that requires multi-hop visual reasoning over high-information-density charts and maps. It proposes InSight-o3, a two-agent framework that separates high-level reasoning (vReasoner) from focused visual search (vSearcher), with a specialized InSight-o3-vS model trained via a hybrid reinforcement learning regime. The results show that plugging in vSearcher yields substantial improvements for frontier models across multiple benchmarks, including strong gains on O3-Bench and generalization across different vReasoners. This work demonstrates a concrete, plug-and-play approach toward open, reasoning-capable multimodal systems and provides rich annotated data and code for the community to build on.

Abstract

The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

TL;DR

This work targets the core challenge of thinking with images by introducing O3-Bench, a hard benchmark that requires multi-hop visual reasoning over high-information-density charts and maps. It proposes InSight-o3, a two-agent framework that separates high-level reasoning (vReasoner) from focused visual search (vSearcher), with a specialized InSight-o3-vS model trained via a hybrid reinforcement learning regime. The results show that plugging in vSearcher yields substantial improvements for frontier models across multiple benchmarks, including strong gains on O3-Bench and generalization across different vReasoners. This work demonstrates a concrete, plug-and-play approach toward open, reasoning-capable multimodal systems and provides rich annotated data and code for the community to build on.

Abstract

The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .

Paper Structure

This paper contains 65 sections, 2 equations, 21 figures, 10 tables, 1 algorithm.

Figures (21)

  • Figure 1: A multi-step visual reasoning example of InSight-o3 on O3-Bench. For clarity, the internal reasoning processes are omitted. More examples can be found in Appendix \ref{['app:o3bench_examples']}.
  • Figure 2: Training pipeline. We use a hybrid RL algorithm to train vSearcher. (a) In the in-loop component, vReasoner generates visual search tasks on-the-fly during training as it tries to answer a user query. We use vReasoner's feedback and final answer correctness as supervision (denoted by dashed arrows) for vSearcher. (b) In the out-of-loop component, we use pre-generated descriptions with ground-truth bounding boxes, allowing us to train vSearcher efficiently via IoU supervision.
  • Figure 3: Training dynamics of InSight-o3. The rightmost chart, "# of vReasoner calls", shows the average number of times vReasoner calls vSearcher per QA. $^{\ast\text{ }}$For fair comparison, the reward curves are plotted under the same setting ("w/o feedback") for all the settings.
  • Figure 4: Distribution of layout numbers in O3-Bench.
  • Figure 5: Resolution distribution.
  • ...and 16 more figures