Table of Contents
Fetching ...

FCC: Fully Connected Correlation for One-Shot Segmentation

Seonghyeon Moon, Haein Kong, Muhammad Haris Khan, Mubbasir Kapadia, Yuewei Lin

TL;DR

The paper tackles one-shot segmentation by enriching prior information through Fully Connected Correlation (FCC), which integrates cross-layer correlations across all ViT encoder layers in addition to traditional same-layer comparisons. By employing a Dual-Condition FCC (DCFC) and a lightweight 4D-convolution decoder, the method captures target-specific patterns across scale, occlusion, and shape variations, leading to state-of-the-art results on PASCAL-5i and COCO-20i and strong generalization under domain shift. Key contributions include the introduction of FCC, the DCFC architecture, ablation analyses validating cross-layer and dual-path benefits, and demonstrated convergence speed advantages. The approach enables robust OSS performance without relying on vision-language or prompt-based models, and it shows potential for broader domain-specific segmentation tasks that demand rich, multi-layer priors.

Abstract

Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks. Therefore, having strong prior information for the target object using the support set is essential for guiding the initial training of FSS, which leads to the success of few-shot segmentation in challenging cases, such as when the target object shows considerable variation in appearance, texture, or scale across the support and query images. Previous methods have tried to obtain prior information by creating correlation maps from pixel-level correlation on final-layer or same-layer features. However, we found these approaches can offer limited and partial information when advanced models like Vision Transformers are used as the backbone. Vision Transformer encoders have a multi-layer structure with identical shapes in their intermediate layers. Leveraging the feature comparison from all layers in the encoder can enhance the performance of few-shot segmentation. We introduce FCC (Fully Connected Correlation) to integrate pixel-level correlations between support and query features, capturing associations that reveal target-specific patterns and correspondences in both same-layers and cross-layers. FCC captures previously inaccessible target information, effectively addressing the limitations of support mask. Our approach consistently demonstrates state-of-the-art performance on PASCAL, COCO, and domain shift tests. We conducted an ablation study and cross-layer correlation analysis to validate FCC's core methodology. These findings reveal the effectiveness of FCC in enhancing prior information and overall model performance.

FCC: Fully Connected Correlation for One-Shot Segmentation

TL;DR

The paper tackles one-shot segmentation by enriching prior information through Fully Connected Correlation (FCC), which integrates cross-layer correlations across all ViT encoder layers in addition to traditional same-layer comparisons. By employing a Dual-Condition FCC (DCFC) and a lightweight 4D-convolution decoder, the method captures target-specific patterns across scale, occlusion, and shape variations, leading to state-of-the-art results on PASCAL-5i and COCO-20i and strong generalization under domain shift. Key contributions include the introduction of FCC, the DCFC architecture, ablation analyses validating cross-layer and dual-path benefits, and demonstrated convergence speed advantages. The approach enables robust OSS performance without relying on vision-language or prompt-based models, and it shows potential for broader domain-specific segmentation tasks that demand rich, multi-layer priors.

Abstract

Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks. Therefore, having strong prior information for the target object using the support set is essential for guiding the initial training of FSS, which leads to the success of few-shot segmentation in challenging cases, such as when the target object shows considerable variation in appearance, texture, or scale across the support and query images. Previous methods have tried to obtain prior information by creating correlation maps from pixel-level correlation on final-layer or same-layer features. However, we found these approaches can offer limited and partial information when advanced models like Vision Transformers are used as the backbone. Vision Transformer encoders have a multi-layer structure with identical shapes in their intermediate layers. Leveraging the feature comparison from all layers in the encoder can enhance the performance of few-shot segmentation. We introduce FCC (Fully Connected Correlation) to integrate pixel-level correlations between support and query features, capturing associations that reveal target-specific patterns and correspondences in both same-layers and cross-layers. FCC captures previously inaccessible target information, effectively addressing the limitations of support mask. Our approach consistently demonstrates state-of-the-art performance on PASCAL, COCO, and domain shift tests. We conducted an ablation study and cross-layer correlation analysis to validate FCC's core methodology. These findings reveal the effectiveness of FCC in enhancing prior information and overall model performance.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The proposed method, FCC, demonstrates precise segmentation in challenging scenarios on COCO-20$^i$lin2015microsoft compared to baseline: (a) Scale Difference: where the target object appears at different scales in the query and support images; (b) Scale Difference + Occlusion: where the target object is partially occluded by other objects, with only part of it visible in the support set; (c) Scale Difference + Occlusion + Shape Difference: where objects of the same class differ in color and shape, as seen in various dog breeds; (d) Scale Difference + Occlusion + Shape Difference + Limited Information: the most challenging case, where minimal information is provided, and all previous challenges are present simultaneously. The green circle highlights the area of significant improvement with FCC.
  • Figure 2: (a) Previous methods in few-shot segmentation typically compare features at the same-level layer to generate the correlation map. (b) FCC, however, leverages all layers and their correlations to capture maximum target information. Note: For clarity, only one query feature block is visualized.
  • Figure 3: (a) The Fully Connected Correlation (FCC) integrates cross-layer and same-layer correlations to capture comprehensive information. (b) The support image and the masked support image are processed through the backbone to extract support and target features, which are then used to generate fully connected correlation maps, $FCC_t$ and $FCC_s$. Query features extracted from the query image contribute to calculating these correlation maps, which are then concatenated and passed through a 1x1 convolution layer to reduce the channel size. The resulting output is fed into the decoder to predict the target segmentation.
  • Figure 4: The Centered Kernel Alignment (CKA pmlr-v97-kornblith19acka2) similarity heatmaps between features from DINOv2-B layers illustrate the effect of target object differences. The background of the images is removed to focus solely on the target object. (a) The CKA displays a stronger concentration along the diagonal in the entire heatmap because we compare the same features. (b) shows the CKA between the biggest and middle-size cat. The center to the right bottom diagonal area is highlighted. (c) shows the similarity between the biggest and the smallest cat. High-layer areas are concentrated along the diagonal, while low-to-mid layers appear to spread. (d) We simulate an occlusion scenario where the cat is obstructed by a large square. (e) We replicate the most challenging conditions, including scale variation, occlusion, shape differences, and limited information. When extracting features, low to middle layers capture different levels of detail based on variations in the object, while higher layers tend to capture more abstract, semantic representations. This results in high similarity values appearing outside of the diagonal entries.
  • Figure 5: The most challenging cases, where the support mask occupies less than 5% of the image in PASCAL-5$^i$pascal and COCO-20$^i$lin2015microsoft. Ours (FCC) with DINOv2 oquab2023dinov2 shows insightful performance compared to Baseline in one-shot setting.
  • ...and 1 more figures