Table of Contents
Fetching ...

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz

Abstract

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Abstract

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Paper Structure (52 sections, 9 equations, 16 figures, 11 tables)

This paper contains 52 sections, 9 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: AwaRes overview. Left: Given a low-resolution image, AwaRes uses tool-calling to request only the high-resolution crops needed to answer the query. Right: Accuracy vs. retained visual tokens across six benchmarks. AwaRes performs similarly to native high-resolution (80.3%) while using only 36% of the visual tokens.
  • Figure 2: Overview of the automatic supervision pipeline. Each sample is processed at two resolutions; an LLM judge determines resolution sufficiency by comparing predictions to ground truth. Sufficient cases yield single-turn conversations, while insufficient cases are routed to an oracle for crop localization, producing multi-turn trajectories with tool-calling.
  • Figure 2: Agreement on resolution-selection labels. Confusion matrix comparing labels produced by LLaMA-3.3-70B against DeepSeek-V3.2 and ANLS. We observe high agreement with DeepSeek-V3.2, and low agreement with ANLS metric. Values are reported as percentages (%).
  • Figure 3: Crop annotation example. Left: low-resolution input where text is illegible. Middle: oracle-predicted bounding box localizing the answer region. Right: selected high-resolution crop enabling correct response (best viewed when zoomed in).
  • Figure 4: Performance vs. Wall Clock Time. AwaRes achieves sub-second average latency across all benchmarks by encoding resolution decisions in short tool calls, whereas VisionThink's explicit reasoning traces increase decoding time (e.g., 4.3s vs. 0.6s on ChartQA).
  • ...and 11 more figures