Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li
TL;DR
This work tackles Ultra-HR RS-VQA by addressing the capacity-resolution mismatch that hinders traditional foundation-model approaches. It introduces ZoomSearch, a training-free framework that explicitly localizes query-relevant regions via Adaptive Multi-Branch Zoom Search and then presents them to a foundation model through Layout-Aware Patch Reassembly, preserving both local structure and global orientation. The approach yields state-of-the-art accuracy on LRS-VQA and MME-RealWorld-RS, while also delivering substantial inference speed gains, and demonstrates robust plug-and-play compatibility across diverse backbones. The findings highlight the practical gains of look-where-it-matters strategies for ultra-high-resolution remote sensing tasks and point to future work in smarter search, verification, and broader backbone integration.
Abstract
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
