Table of Contents
Fetching ...

Boosting Few-Shot Segmentation via Instance-Aware Data Augmentation and Local Consensus Guided Cross Attention

Li Guo, Haoming Liu, Yuxuan Xia, Chengyu Zhang, Xiaochen Lu

TL;DR

The paper tackles generalization in few-shot semantic segmentation, where prototype-based methods struggle under domain shifts. It reconsiders fine-tuning by adding two key components: Instance-Aware Data Augmentation (IDA) to diversify the small support set, and Local Consensus Guided Cross Attention (LCCA) to align query and support features via dense cross-image correlations. The approach is implemented within a two-stage training framework and extended to the K-shot setting, with a synergistic integration of IDA and LC-CAN that yields substantial gains on PASCAL-$5^i$ and COCO-$20^i$, particularly in 5-shot scenarios. The results demonstrate improved generalization and robustness, suggesting that combining targeted augmentation with cross-image correspondence can bridge the gap between fine-tuning and prototype-based methods in FSS.

Abstract

Few-shot segmentation aims to train a segmentation model that can fast adapt to a novel task for which only a few annotated images are provided. Most recent models have adopted a prototype-based paradigm for few-shot inference. These approaches may have limited generalization capacity beyond the standard 1- or 5-shot settings. In this paper, we closely examine and reevaluate the fine-tuning based learning scheme that fine-tunes the classification layer of a deep segmentation network pre-trained on diverse base classes. To improve the generalizability of the classification layer optimized with sparsely annotated samples, we introduce an instance-aware data augmentation (IDA) strategy that augments the support images based on the relative sizes of the target objects. The proposed IDA effectively increases the support set's diversity and promotes the distribution consistency between support and query images. On the other hand, the large visual difference between query and support images may hinder knowledge transfer and cripple the segmentation performance. To cope with this challenge, we introduce the local consensus guided cross attention (LCCA) to align the query feature with support features based on their dense correlation, further improving the model's generalizability to the query image. The significant performance improvements on the standard few-shot segmentation benchmarks PASCAL-$5^i$ and COCO-$20^i$ verify the efficacy of our proposed method.

Boosting Few-Shot Segmentation via Instance-Aware Data Augmentation and Local Consensus Guided Cross Attention

TL;DR

The paper tackles generalization in few-shot semantic segmentation, where prototype-based methods struggle under domain shifts. It reconsiders fine-tuning by adding two key components: Instance-Aware Data Augmentation (IDA) to diversify the small support set, and Local Consensus Guided Cross Attention (LCCA) to align query and support features via dense cross-image correlations. The approach is implemented within a two-stage training framework and extended to the K-shot setting, with a synergistic integration of IDA and LC-CAN that yields substantial gains on PASCAL- and COCO-, particularly in 5-shot scenarios. The results demonstrate improved generalization and robustness, suggesting that combining targeted augmentation with cross-image correspondence can bridge the gap between fine-tuning and prototype-based methods in FSS.

Abstract

Few-shot segmentation aims to train a segmentation model that can fast adapt to a novel task for which only a few annotated images are provided. Most recent models have adopted a prototype-based paradigm for few-shot inference. These approaches may have limited generalization capacity beyond the standard 1- or 5-shot settings. In this paper, we closely examine and reevaluate the fine-tuning based learning scheme that fine-tunes the classification layer of a deep segmentation network pre-trained on diverse base classes. To improve the generalizability of the classification layer optimized with sparsely annotated samples, we introduce an instance-aware data augmentation (IDA) strategy that augments the support images based on the relative sizes of the target objects. The proposed IDA effectively increases the support set's diversity and promotes the distribution consistency between support and query images. On the other hand, the large visual difference between query and support images may hinder knowledge transfer and cripple the segmentation performance. To cope with this challenge, we introduce the local consensus guided cross attention (LCCA) to align the query feature with support features based on their dense correlation, further improving the model's generalizability to the query image. The significant performance improvements on the standard few-shot segmentation benchmarks PASCAL- and COCO- verify the efficacy of our proposed method.
Paper Structure (24 sections, 10 equations, 7 figures, 9 tables)

This paper contains 24 sections, 10 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: FSS performance (mIoU) on COCO-$20^i$ as the shot number increases. All models adopt a ResNet-50 backbone. ‘Finetune' represents the baseline fine-tuning approach described in Section \ref{['ssec:overall']}.
  • Figure 2: Visualization of instance-aware data augmentation. Original images are resized (while preserving the aspect ratio) and padded with grey pixels to the input image size. Given the foreground ratio $\mu$, we perform (a) instance-aware cropping when $\mu<\pi_l$ and (b) image downsizing when $\mu>\pi_h$ . The solid line in (a) is the bounding box of the largest target object, and the dashed line represents the cropping window.
  • Figure 3: Overview of LC-CAN, which is trained in two stages. In the first stage, the backbone encoder and decoder are pre-trained on base classes. In the second stage, we meta-learn the LCCA module in an episodic manner. At inference time, we train the classifier with the IDA-augmented support set and then pass the LCCA-aligned query feature to the learned classifier for query mask prediction. Note that LCCA module is based on features from multiple intermediate layers and the diagram only illustrates one layer for simplicity.
  • Figure 4: The local self-attention layer with spatial extent of $k=3$.
  • Figure 5: Structure of the correlation network, which is composed of a sequence of multi-channel 4D convolution units.
  • ...and 2 more figures