Table of Contents
Fetching ...

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li

TL;DR

This work tackles the challenge of enabling efficient, accurate vision-language understanding on gigapixel remote sensing images with large vision-language models. It introduces a coarse-to-fine, text-guided token pruning framework that combines a Region Focus Module (RFM) with a Dynamic Image Pyramid (DIP) to selectively process text-relevant tiles and tokens, reducing computation while preserving detail. To evaluate progress, the authors present LRS-VQA, a large RSI visual-question-answering benchmark with up to 27,328-pixel images and diverse question types. Empirical results show improved accuracy and substantial efficiency gains over existing high-resolution methods, validating the effectiveness of language-guided localization for RSIs. The approach is architecture-agnostic and offers a practical pathway for scalable RS-VLM deployment in real-world analysis tasks.

Abstract

Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

TL;DR

This work tackles the challenge of enabling efficient, accurate vision-language understanding on gigapixel remote sensing images with large vision-language models. It introduces a coarse-to-fine, text-guided token pruning framework that combines a Region Focus Module (RFM) with a Dynamic Image Pyramid (DIP) to selectively process text-relevant tiles and tokens, reducing computation while preserving detail. To evaluate progress, the authors present LRS-VQA, a large RSI visual-question-answering benchmark with up to 27,328-pixel images and diverse question types. Empirical results show improved accuracy and substantial efficiency gains over existing high-resolution methods, validating the effectiveness of language-guided localization for RSIs. The approach is architecture-agnostic and offers a practical pathway for scalable RS-VLM deployment in real-world analysis tasks.

Abstract

Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 18 tables, 1 algorithm.

Figures (6)

  • Figure 1: High-resolution strategy comparison for modular LVLMs. (a) and (b) show that existing grid-based cropping methods face challenges when processing large RSIs. (c) The proposed dynamic pyramid-based token pruning strategy can dynamically select image tiles of key regions related to the input text, balancing image detail and computational cost.
  • Figure 2: The pipeline of the proposed method. The entire process iterates in a coarse-to-fine manner, dynamically retrieving high-resolution features from the next DIP level (leftward orange arrow) or performing token pruning at the current level (rightward orange arrow) based on the output of the RFM module at each iteration. During training, the RFM distillation text-related attention from the LLM; during inference, RFM generates the attention scores for the input vision tokens. GSD means ground sample distance.
  • Figure 3: The proposed RFM and attention distillation strategy. The left part indicates our core idea: distill accurate text-related key region localization ability from the LLM part of the LVLM. The right part shows the distillation details. We only select specific layer pairs for distillation to avoid hidden state discontinuities. "sys token" represents the tokens from the system prompt.
  • Figure 4: The construction pipeline of the proposed LRS-VQA dataset. The visual prompt (red box) is inspired by SoM yang2023setofmark.
  • Figure 5: The accuracy trends of Qwen2-VL across varying input maximum pixels. This demonstrates that accuracy on both the manually annotated MME-RealWorld-RS and our proposed LRS-VQA exhibit a positive correlation with resolution improvement, proving the effectiveness of LRS-VQA in evaluating LVLM's high-resolution RSI perception capabilities.
  • ...and 1 more figures