Table of Contents
Fetching ...

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin

TL;DR

This work introduces Zoom Eye, a training-free, model-agnostic tree-search framework that enables multimodal LLMs to perform vision-level reasoning on high-resolution images by progressively zooming into informative regions. By representing an image as a hierarchical patch tree and using a prompts-driven confidence and a stopping criterion, Zoom Eye guides flexible, human-like zooming behavior to collect task-relevant visual cues. The system is implemented with two image input paradigms (Local and Global+Local) and is demonstrated to significantly improve performance across multiple MLLMs on high-resolution benchmarks and real-world tasks, while also revealing a phenomenon analogous to test-time scaling in text-based reasoning. The work also analyzes limitations and provides a thorough comparison to existing high-resolution processing approaches, highlighting Zoom Eye’s training-free advantage and practical impact for robust visual reasoning in real-world applications.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of high-resolution benchmarks and the results demonstrate that Zoom Eye consistently improves the performance of multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

TL;DR

This work introduces Zoom Eye, a training-free, model-agnostic tree-search framework that enables multimodal LLMs to perform vision-level reasoning on high-resolution images by progressively zooming into informative regions. By representing an image as a hierarchical patch tree and using a prompts-driven confidence and a stopping criterion, Zoom Eye guides flexible, human-like zooming behavior to collect task-relevant visual cues. The system is implemented with two image input paradigms (Local and Global+Local) and is demonstrated to significantly improve performance across multiple MLLMs on high-resolution benchmarks and real-world tasks, while also revealing a phenomenon analogous to test-time scaling in text-based reasoning. The work also analyzes limitations and provides a thorough comparison to existing high-resolution processing approaches, highlighting Zoom Eye’s training-free advantage and practical impact for robust visual reasoning in real-world applications.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of high-resolution benchmarks and the results demonstrate that Zoom Eye consistently improves the performance of multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye

Paper Structure

This paper contains 35 sections, 1 equation, 6 figures, 9 tables, 4 algorithms.

Figures (6)

  • Figure 1: Top: When dealing with a high-resolution image, MLLMs effectively perceive the dominant objects but often fail to recognize finer details, highlighting the need for vision-level reasoning. Bottom: Applied with Zoom Eye, MLLMs could perform vision-level reasoning, allowed to explore the image details until they can answer the question.
  • Figure 2: Zoom Eye enables MLLMs to (a) answer the question directly when the visual information is adequate, (b) zoom in gradually for a closer examination, and (c) zoom out to the previous view and explore other regions if the desired information is not initially found.
  • Figure 3: Two image input methods for MLLMs with distinct image processing.
  • Figure 4: The relationship between the number of search steps and the performance of the MLLM. The experimental statistics are derived from LLaVA-ov-7B’s results on $V^*$ Bench.
  • Figure 5: Examples of Zoom Eye. The resolution of the image is displayed. Red rectangles are patches searched by Zoom Eye.
  • ...and 1 more figures