Table of Contents
Fetching ...

Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation

Jonas Herzog

TL;DR

This work tackles cross-domain few-shot segmentation by abandoning training on a source domain and instead performing test-time task adaptation. By attaching tiny per-layer adapters to a frozen ImageNet-pretrained backbone and enforcing consistency through dense contrastive losses, the method specializes features to the target task before performing dense query-support comparison. The approach achieves state-of-the-art results on CD-FSS benchmarks, demonstrating that test-time adaptation can outperform traditional training-based generalization strategies. The findings argue for rethinking CD-FSS from training-time generalization to robust, task-specific adaptation at inference, with implications for efficiency and practical deployment.

Abstract

Few-shot segmentation performance declines substantially when facing images from a domain different than the training domain, effectively limiting real-world use cases. To alleviate this, recently cross-domain few-shot segmentation (CD-FSS) has emerged. Works that address this task mainly attempted to learn segmentation on a source domain in a manner that generalizes across domains. Surprisingly, we can outperform these approaches while eliminating the training stage and removing their main segmentation network. We show test-time task-adaption is the key for successful CD-FSS instead. Task-adaption is achieved by appending small networks to the feature pyramid of a conventionally classification-pretrained backbone. To avoid overfitting to the few labeled samples in supervised fine-tuning, consistency across augmented views of input images serves as guidance while learning the parameters of the attached layers. Despite our self-restriction not to use any images other than the few labeled samples at test time, we achieve new state-of-the-art performance in CD-FSS, evidencing the need to rethink approaches for the task.

Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation

TL;DR

This work tackles cross-domain few-shot segmentation by abandoning training on a source domain and instead performing test-time task adaptation. By attaching tiny per-layer adapters to a frozen ImageNet-pretrained backbone and enforcing consistency through dense contrastive losses, the method specializes features to the target task before performing dense query-support comparison. The approach achieves state-of-the-art results on CD-FSS benchmarks, demonstrating that test-time adaptation can outperform traditional training-based generalization strategies. The findings argue for rethinking CD-FSS from training-time generalization to robust, task-specific adaptation at inference, with implications for efficiency and practical deployment.

Abstract

Few-shot segmentation performance declines substantially when facing images from a domain different than the training domain, effectively limiting real-world use cases. To alleviate this, recently cross-domain few-shot segmentation (CD-FSS) has emerged. Works that address this task mainly attempted to learn segmentation on a source domain in a manner that generalizes across domains. Surprisingly, we can outperform these approaches while eliminating the training stage and removing their main segmentation network. We show test-time task-adaption is the key for successful CD-FSS instead. Task-adaption is achieved by appending small networks to the feature pyramid of a conventionally classification-pretrained backbone. To avoid overfitting to the few labeled samples in supervised fine-tuning, consistency across augmented views of input images serves as guidance while learning the parameters of the attached layers. Despite our self-restriction not to use any images other than the few labeled samples at test time, we achieve new state-of-the-art performance in CD-FSS, evidencing the need to rethink approaches for the task.
Paper Structure (29 sections, 28 equations, 13 figures, 11 tables, 3 algorithms)

This paper contains 29 sections, 28 equations, 13 figures, 11 tables, 3 algorithms.

Figures (13)

  • Figure 1: Top: Few Shot Segmentation across domains has been addressed by training a deep network on segmentation tasks from a source domain. We demonstrate that its efforts to achieve generalizability during this stage are largely unsuccessful. Bottom: In the proposed approach, we entirely forgo such training. Instead, backbone-attached layers (green) adapt features to the target task at test-time.
  • Figure 1: FB-IoU is important to report besides mIoU: One could naively outperform previous SOTA on mIoU by simply assigning foreground to all query pixels (100%). 1-shot Deepglobe results, true foreground ratio is 43.5%. †: obtained with models trained by ourselves.
  • Figure 2: Overview of proposed method: Query (red) and support (blue) images are augmented to generate views of them. Original image and views are fed separately through a frozen backbone as well as our attached task-specific heads to generate a lower-dimensional feature pyramid. The task-specific networks are trained to maximize intra-level consistency across views. Adapted features are then densely compared in the cross-correlation module. Finally, the level-wise prediction maps are aggregated, thresholded and refined to generate a binary query foreground class prediction.
  • Figure 3: Issue of Deepglobe ground truth annotation. Image row showing an episode featuring the pink overlaid Agricultural Land class. Green encircled area contains inaccurate inclusion of Forest areas in the ground truth (Query) annotation. Notably, our model appears to segment agricultural land more precise than the ground truth.
  • Figure 4: Against common belief, fine-tuning does not lead to overfitting to the support set with our approach. Through learning of consistent embedding spaces, we enhance class discriminability not only for the support (solid lines), but also for the test query (dashed). As a result, irrelevant regions are no longer activated in the coarse query prediction with TA.
  • ...and 8 more figures