Table of Contents
Fetching ...

GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, Yu Liu

TL;DR

GeoPix addresses the gap in pixel-level dialogue for remote sensing by introducing a mask-predictor and a class-wise memory module that stores shared geo-context across scales. The approach couples a pre-trained vision encoder and an LLM with a multi-scale mask predictor, conditioned by segmentation tokens, and enhanced by a memory bank that retrieves class- and scale-aware context. To train effectively without large pixel-level RS datasets, the authors build GeoPixInstruct and implement a two-stage training strategy to balance text generation and mask prediction, achieving state-of-the-art results in multi-referring segmentation while maintaining competitive image- and region-level performance. The work advances pixel-level RS dialogue, offering a practical path toward fine-grained, instruction-following remote sensing interpretation and potentially enabling temporal and complex scene understanding in applied settings.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.

GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

TL;DR

GeoPix addresses the gap in pixel-level dialogue for remote sensing by introducing a mask-predictor and a class-wise memory module that stores shared geo-context across scales. The approach couples a pre-trained vision encoder and an LLM with a multi-scale mask predictor, conditioned by segmentation tokens, and enhanced by a memory bank that retrieves class- and scale-aware context. To train effectively without large pixel-level RS datasets, the authors build GeoPixInstruct and implement a two-stage training strategy to balance text generation and mask prediction, achieving state-of-the-art results in multi-referring segmentation while maintaining competitive image- and region-level performance. The work advances pixel-level RS dialogue, offering a practical path toward fine-grained, instruction-following remote sensing interpretation and potentially enabling temporal and complex scene understanding in applied settings.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
Paper Structure (29 sections, 3 equations, 10 figures, 11 tables)

This paper contains 29 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of GeoPix capabilities in multi-modal, multi-task remote sensing image interpretation through dialogue. GeoPix supports image-level tasks, including image captioning and visual question answering, as well as region-level tasks like visual grounding. Additionally, it expands its functionality to pixel-level tasks, specifically multi-referring segmentation, as depicted in the purple panel.
  • Figure 2: Overview of the proposed GeoPix model architecture and the detailed design of the class-wise learnable memory (CLM) module. The left panel illustrates the overall architecture of GeoPix. GeoPix takes user text instructions and images as input and outputs corresponding answers and segmentation results (if the user’s instruction requests single or multiple object segmentation). The user-input image is encoded to extract multi-scale visual features via the vision encoder. These features are transformed by independent projectors and fed into both the mask predictor and the LLM. The LLM processes the instructions along with the deepest scale of visual features to predict interleaved text and segmentation tokens. Subsequently, the segmentation tokens serve as conditions for the mask predictor, guiding it to predict masks for the instances specified by the user. The right panel showcases the detailed design of the CLM module. The CLM module first encodes the initial mask $m_{\text{init}}^{\ell}$ of each scale $\ell$ into latent representations using the memory encoder, and then retrieves memory features from the memory bank using the category query $Q_{c}$ and scale query $Q_{\ell}$. The memory features are combined with the encoded initial mask via element-wise addition. Finally, the aggregated features serve as the key and value in the memory attention module, enhancing the visual features to obtain memory-enhanced features $H_\text{me}$.
  • Figure 3: The left panel displays the instruction used to prompt both GPT-4o and its fine-tuned version within our description generation pipeline. The right panel provides an example description generated by the fine-tuned GPT-4o. Bounding boxes are shown for instance differentiation and are not part of GPT-4o’s input.
  • Figure 4: Category distribution analysis and word cloud visualization for the SIOR-T, FAST-T, and SOTA-T subsets of GeoPixInstruct. "expwy. ser. area" stands for expressway service area, "expwy. toll stat." stands for expressway toll station, "grd. track fld." stands for ground track field, "stat." is the abbreviation for station, "ct." is the abbreviation for court, and "fld." is the abbreviation for field.
  • Figure 5: The distribution of instance number ($\varphi$) by mask coverage ratio ($\theta$). Subset located toward the lower-left corner represent smaller instance number with lower coverage ratio, indicating higher segmentation difficulty.
  • ...and 5 more figures