SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Weijia Zhang; Zijia Liu; Haoru Li; Haoqi Chen; Jiaxuan You

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Weijia Zhang, Zijia Liu, Haoru Li, Haoqi Chen, Jiaxuan You

TL;DR

SeeingEye introduces a two-agent framework that decouples perception from reasoning to enable multimodal reasoning in text-only LLMs via a Structured Intermediate Representation (SIR). The Translator Agent converts visual input into a rich, query-relevant SIR through adaptive tool use and Visual Chain-of-Thought, while the Reasoning Agent performs high-level cognition on the SIR with multi-round feedback. Experiments on MMMU, MMMU-Pro, OCR-BenchV2, and MIA-Bench show the approach surpasses larger end-to-end VLMs and offers cost efficiency through modular design. The work demonstrates that agentic information flow and structured communication between specialized agents provide a scalable path to leverage strong text-only LLMs for complex multimodal reasoning tasks.

Abstract

Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the text-only LLM, which serves as a reasoning agent. Crucially, the translator and reasoner engage in multi-round feedback and interaction, enabling the extraction of targeted visual details and yielding more confident answers. Experiments on knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate that Seeing Eye not only reduces inference cost but also surpasses much larger end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision translator with an 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions. Our results highlight that decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities. Code is available at: https://github.com/ulab-uiuc/SeeingEye

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

TL;DR

Abstract

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)