Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay; Moshe Kimhi; Artem Spector; Sivan Haray; Ehud Rivlin; Chaim Baskin; Raja Giryes; Eli Schwartz

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz

Abstract

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Abstract

Paper Structure (52 sections, 9 equations, 16 figures, 11 tables)

This paper contains 52 sections, 9 equations, 16 figures, 11 tables.

Introduction
Related Work
Method
Problem setup
Data curation: automatic supervision for crop requests
Stage 1: resolution-sufficiency labeling (when to crop).
Stage 2: crop target construction (where to crop).
Stage 3: supervised tool-use trajectories.
Cold-start supervised reference policy (SFT)
Multi-turn GRPO
Rollouts and trajectories.
Reward design.
GRPO optimization.
Inference
Experimental Results
...and 37 more sections

Figures (16)

Figure 1: AwaRes overview. Left: Given a low-resolution image, AwaRes uses tool-calling to request only the high-resolution crops needed to answer the query. Right: Accuracy vs. retained visual tokens across six benchmarks. AwaRes performs similarly to native high-resolution (80.3%) while using only 36% of the visual tokens.
Figure 2: Overview of the automatic supervision pipeline. Each sample is processed at two resolutions; an LLM judge determines resolution sufficiency by comparing predictions to ground truth. Sufficient cases yield single-turn conversations, while insufficient cases are routed to an oracle for crop localization, producing multi-turn trajectories with tool-calling.
Figure 2: Agreement on resolution-selection labels. Confusion matrix comparing labels produced by LLaMA-3.3-70B against DeepSeek-V3.2 and ANLS. We observe high agreement with DeepSeek-V3.2, and low agreement with ANLS metric. Values are reported as percentages (%).
Figure 3: Crop annotation example. Left: low-resolution input where text is illegible. Middle: oracle-predicted bounding box localizing the answer region. Right: selected high-resolution crop enabling correct response (best viewed when zoomed in).
Figure 4: Performance vs. Wall Clock Time. AwaRes achieves sub-second average latency across all benchmarks by encoding resolution decisions in short tool calls, whereas VisionThink's explicit reasoning traces increase decoding time (e.g., 4.3s vs. 0.6s on ChartQA).
...and 11 more figures

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Abstract

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Authors

Abstract

Table of Contents

Figures (16)