Table of Contents
Fetching ...

Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li

TL;DR

This work tackles Ultra-HR RS-VQA by addressing the capacity-resolution mismatch that hinders traditional foundation-model approaches. It introduces ZoomSearch, a training-free framework that explicitly localizes query-relevant regions via Adaptive Multi-Branch Zoom Search and then presents them to a foundation model through Layout-Aware Patch Reassembly, preserving both local structure and global orientation. The approach yields state-of-the-art accuracy on LRS-VQA and MME-RealWorld-RS, while also delivering substantial inference speed gains, and demonstrates robust plug-and-play compatibility across diverse backbones. The findings highlight the practical gains of look-where-it-matters strategies for ultra-high-resolution remote sensing tasks and point to future work in smarter search, verification, and broader backbone integration.

Abstract

With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

TL;DR

This work tackles Ultra-HR RS-VQA by addressing the capacity-resolution mismatch that hinders traditional foundation-model approaches. It introduces ZoomSearch, a training-free framework that explicitly localizes query-relevant regions via Adaptive Multi-Branch Zoom Search and then presents them to a foundation model through Layout-Aware Patch Reassembly, preserving both local structure and global orientation. The approach yields state-of-the-art accuracy on LRS-VQA and MME-RealWorld-RS, while also delivering substantial inference speed gains, and demonstrates robust plug-and-play compatibility across diverse backbones. The findings highlight the practical gains of look-where-it-matters strategies for ultra-high-resolution remote sensing tasks and point to future work in smarter search, verification, and broader backbone integration.

Abstract

With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

Paper Structure

This paper contains 37 sections, 13 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: (i) Illustration of the proposed ZoomSearch pipeline for Ultra-HR RS-VQA. Given an Ultra-HR remote sensing image, ZoomSearch performs a coarse–to-fine zoom-in search, reassembles the selected patches via layout-aware composition, and feeds the results into a foundation model to get the answer. (ii) Overall comparison on LRS-VQA. ZoomSearch+LLaVA-ov surpasses all other thirteen methods, achieving an accuracy that is 12.5% higher than the previous best GPT-4o and 26.3% higher than the LLaVA-ov baseline.
  • Figure 2: Pilot-study results on the LRS-VQA subset. (i) Among three search units, the hierarchical search policy achieves the best performance. (ii) Among three reassembly strategies, the relative & global-layout preserving design yields the highest accuracy.
  • Figure 3: Overview of the proposed ZoomSearch pipeline for Ultra-HR RS-VQA. The top-left part illustrates Adaptive Multi-Branch Zoom Search, which progressively explores the image and focuses on regions that are closely related to the text query. The bottom part shows the scoring mechanism, where each candidate patch is evaluated by a patch--text relevance score from an external scoring model and a model-evidence signal from the foundation model. The top-right part depicts Layout-Aware Patch Reassembly, which reorganizes the selected informative patches into a spatially consistent canvas that preserves their relative and global positions.
  • Figure 4: Qualitative comparison between our method and other search-based methods on an object color recognition task.
  • Figure 5: Qualitative comparison between our method and other search-based methods on an object counting task.
  • ...and 3 more figures