Table of Contents
Fetching ...

AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing

Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, Xian Sun

TL;DR

AgMTR introduces local-aware agents for remote sensing few-shot segmentation to overcome pixel-level mismatches caused by extreme intra-class variation and clutter. Through Agent Learning Encoder, Agent Aggregation Decoder, and Semantic Alignment Decoder, the model dynamically mines and aligns class-specific semantics from support, unlabeled, and query data, respectively, and uses a cross-attention-based matching strategy. Empirical results on iSAID achieve state-of-the-art performance, while experiments on PASCAL-5^i and COCO-20^i demonstrate strong generalization to natural scenes. The approach enables robust, context-rich segmentation with competitive efficiency, and extensions to weak-label and cross-domain settings illustrate practical versatility in diverse scenarios.

Abstract

Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder (ALE) is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder (AAD) and the Semantic Alignment Decoder (SAD) are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i and COCO-20i.

AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing

TL;DR

AgMTR introduces local-aware agents for remote sensing few-shot segmentation to overcome pixel-level mismatches caused by extreme intra-class variation and clutter. Through Agent Learning Encoder, Agent Aggregation Decoder, and Semantic Alignment Decoder, the model dynamically mines and aligns class-specific semantics from support, unlabeled, and query data, respectively, and uses a cross-attention-based matching strategy. Empirical results on iSAID achieve state-of-the-art performance, while experiments on PASCAL-5^i and COCO-20^i demonstrate strong generalization to natural scenes. The approach enables robust, context-rich segmentation with competitive efficiency, and extensions to weak-label and cross-domain settings illustrate practical versatility in diverse scenarios.

Abstract

Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder (ALE) is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder (AAD) and the Semantic Alignment Decoder (SAD) are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i and COCO-20i.
Paper Structure (33 sections, 15 equations, 23 figures, 13 tables, 3 algorithms)

This paper contains 33 sections, 15 equations, 23 figures, 13 tables, 3 algorithms.

Figures (23)

  • Figure 1: (a) Due to the extreme intra-class variations and cluttered backgrounds in remote sensing scenes, directly leveraging the correlations between support-query pixel pairs to aggregate support semantics may result in tremendous mismatches. In this case, both FG and BG pixels of the query are likely to aggregate the FG semantics of the support, resulting in semantic ambiguity. (b) AgMTR mines a set of local-aware agents from support, unlabeled, and query images for constructing agent-level semantic correlation. At this point, query FG-pixel $P_2$ will selectively aggregate the agent semantics responsible for the 'Fuselage', while the BG-pixel $P_3$ will aggregate the background agent semantics, implementing semantic clarity between the query FG and BG pixels.
  • Figure 2: The specific process of mining agents. Firstly, Agent Learning Encoder (ALE) dynamically divides the image into different but complementary local regions, e.g., 'Fuselage', 'Wing', 'Tail', 'Background', etc., via the support foreground mask, thus assigning different local semantics to different agents. To further optimize the agents, the Agent Aggregation Decoder (AAD) introduces a set of unlabeled images and obtains a set of local prototypes through unsupervised clustering, where the agents will selectively perceive and aggregate fine-grained information from these prototypes. Finally, the Semantic Alignment Decoder (SAD) constructs the query pseudo-local masks to create conditions for the agents to aggregate the local semantics from the query image.
  • Figure 3: The overall pipeline of the proposed AgMTR, aims at constructing agent-level semantic correlation to correct the semantic ambiguity between the query FG and BG pixels caused by pixel-level mismatches. Three components are proposed to excavate representative agents without explicit supervision: Agent Learning Encoder (ALE), Agent Aggregation Decoder (AAD), and Semantic Alignment Decoder (SAD). First, ALE dynamically divides the support mask into multiple local masks, thus guiding different agents to mine the semantics under different local masks for local-awareness. AAD then introduces a series of unlabeled images where the agents could selectively perceive and aggregate semantics beneficial to them from the unlabeled data source to break through the limited support set. Finally, SAD aims to mine the query image's own interested semantics to promote semantic consistency between the agents and the query object.
  • Figure 4: The detailed pipeline of Agent Learning Encoder (ALE), which aims at driving the agents to efficiently capture interested information from support pixels through the masked cross-attention mechanism. To enhance the semantic diversity of agents, ALE dynamically decomposes the foreground mask into different yet complementary local masks, thus ensuring that different agents aggregate different local semantics.
  • Figure 5: The detailed pipeline of Agent Aggregation Decoder (AAD). AAD introduces unlabeled images containing the interested class as a reference, expecting agents to adaptively explore the richer interested semantics from unlabeled data sources to break through the support set limitations.
  • ...and 18 more figures