Table of Contents
Fetching ...

AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation

Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li

TL;DR

AFANet addresses the challenge of weakly-supervised few-shot semantic segmentation by leveraging frequency-domain information and online cross-modal guidance. The Cross-Granularity Frequency-Aware Module decouples RGB features into high- and low-frequency components across a pyramid backbone and realigns them to enrich semantic structure, while the CLIP-Guided Spatial-Adapter Module online-tunes CLIP’s textual priors to the downstream task distribution. Together, these components provide stronger semantic guidance under scarce annotations and enable robust pseudo-masks for support and query images. On Pascal-5i and COCO-20i, AFANet achieves state-of-the-art results, demonstrating the benefits of integrating frequency-domain cues with online CLIP adaptation for WFSS.

Abstract

Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5\textsuperscript{i} and COCO-20\textsuperscript{i} datasets demonstrate that AFANet has achieved state-of-the-art performance. The code is available at https://github.com/jarch-ma/AFANet.

AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation

TL;DR

AFANet addresses the challenge of weakly-supervised few-shot semantic segmentation by leveraging frequency-domain information and online cross-modal guidance. The Cross-Granularity Frequency-Aware Module decouples RGB features into high- and low-frequency components across a pyramid backbone and realigns them to enrich semantic structure, while the CLIP-Guided Spatial-Adapter Module online-tunes CLIP’s textual priors to the downstream task distribution. Together, these components provide stronger semantic guidance under scarce annotations and enable robust pseudo-masks for support and query images. On Pascal-5i and COCO-20i, AFANet achieves state-of-the-art results, demonstrating the benefits of integrating frequency-domain cues with online CLIP adaptation for WFSS.

Abstract

Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5\textsuperscript{i} and COCO-20\textsuperscript{i} datasets demonstrate that AFANet has achieved state-of-the-art performance. The code is available at https://github.com/jarch-ma/AFANet.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Image RGB domain and frequency domain distribution display. (a) RGB color model. The three primary colors can provide information such as the color and texture of an image. (b) Frequency domain distribution. High-frequency distribution information represents rapid changes in the image and can provide image detail information, while low-frequency distribution information represents slow changes in the image and can offer global structural information.
  • Figure 2: Overview of our AFANet framework. (1) Cross-granularity frequency-aware module (CFM) extracts information from the low (layer 3), middle (layer 9) and high layers (layer 12) of the backbone respectively. Subsequently, FAM decomposes the RGB domain information into high-frequency and low-frequency distributions of different granularities and optimizes the spatial structural information of the frequency domain by realigning them. (2) CLIP-guided spatial-adapter module (CSM) first reshapes the CLIP text information to $C_{N}\times h \times w$ to adapt to the network model, and then, guided by the output $f_{s,q}$ of CFM, reduces the distribution gap between prior knowledge and the network model in a spatially adaptive manner. Finally, the CSM output information $f_{s}^{"}$ and $f_{q}^{"}$ are passed into the segmentation network together with the pseudo masks $\tilde{M}_{s}$ and $\tilde{M} _{q}$. Further details can be found in our baseline IMR-HSNet method_6.
  • Figure 3: Frequency-aware module (FAM). The green and red arrows denote the updating and exchange processes of low-frequency and high-frequency information, respectively.
  • Figure 4: Qualitative Analysis: Visualizing Segmentation Results Under a 1-Shot Setting. The data is sourced from Pascal-5i. Organized top-down, each row represents the support image, query image (ground truth mask), segmentation results from IMR-HSNet (baseline), and segmentation results from our model (AFANet), respectively. Each column represents different categories.
  • Figure 5: Ablation study of different network layers. Extract feature maps from a fixed pre-trained Backbone (ResNet-50) for low layers (0, 1, 2), mid layers (6, 7, 8), high layers (10, 11, 12), and our cross layers (3, 9, 12), respectively.