Table of Contents
Fetching ...

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang

TL;DR

This work introduces Region-to-Image Distillation (R2I), a training-time approach that internalizes the benefits of test-time zooming in multimodal LLMs to enable single-pass, fine-grained perception. By generating high-quality region-grounded VQA data from micro-crops using strong teacher models and distilling it back to full images with explicit grounding, the method teaches models to extract minute evidence without iterative cropping during inference. The authors also present ZoomBench, a hybrid-annotated benchmark with a dual-view protocol and relative-attention analysis to quantify the global-regional zooming gap and interpretability. Experiments show that ZwZ models achieve leading performance on fine-grained perception benchmarks and improve general multimodal cognition while dramatically reducing inference latency compared to Thinking-with-Images baselines. The work also discusses when tool-based thinking is necessary and how those gains can be distilled, offering a practical path toward faster, more reliable multimodal perception.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

TL;DR

This work introduces Region-to-Image Distillation (R2I), a training-time approach that internalizes the benefits of test-time zooming in multimodal LLMs to enable single-pass, fine-grained perception. By generating high-quality region-grounded VQA data from micro-crops using strong teacher models and distilling it back to full images with explicit grounding, the method teaches models to extract minute evidence without iterative cropping during inference. The authors also present ZoomBench, a hybrid-annotated benchmark with a dual-view protocol and relative-attention analysis to quantify the global-regional zooming gap and interpretability. Experiments show that ZwZ models achieve leading performance on fine-grained perception benchmarks and improve general multimodal cognition while dramatically reducing inference latency compared to Thinking-with-Images baselines. The work also discusses when tool-based thinking is necessary and how those gains can be distilled, offering a practical path toward faster, more reliable multimodal perception.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.
Paper Structure (40 sections, 12 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 12 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Average scores across multimodal perception benchmarks. ZwZ-4B/7B/8B demonstrate competitive performance compared with current SOTA MLLMs (e.g., Gemini-3-Flash, Kimi-K2.5, Qwen3-VL-235B).
  • Figure 2: Zooming without Zooming. "Thinking-with-Images" models rely on iterative tool-based cropping and re-encoding at inference, incurring high latency. Our Region-to-Image Distillation performs zooming only during training to synthesize region-grounded supervision on the full image, enabling single-pass fine-grained perception without test-time tool use.
  • Figure 3: Overview of Region-to-Image Distillation. We synthesize fine-grained VQA pairs on zoomed-in micro-crops using strong teachers with consensus filtering, then distill them to the full image via box-overlay grounding and an augmented prompt, enabling improved single-pass inference without test-time zooming.
  • Figure 4: Category distribution across six fine-grained dimensions of our benchmark (left) and ZoomBench data statistics: distribution of image resolutions (middle) and crop-to-image area ratios (right).
  • Figure 4: We compare our models (single forward pass) with agentic models on several perception benchmarks. The best results are highlighted in bold, and the second-best are underlined.
  • ...and 4 more figures