Table of Contents
Fetching ...

UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation

Wei Zhuo, Zhiyue Tang, Wufeng Xue, Hao Ding, Junkai Ji, Linlin Shen

TL;DR

UINO-FSS presents a unified, efficient approach to few-shot semantic segmentation by distilling SAM knowledge into a frozen DINOv2 encoder through a lightweight segmenter comprising a Bottleneck Adapter, Meta-Visual Prompt Generator, and a SAM-initialized decoder. A coarse-to-fine hierarchical distillation aligns cross-model embeddings, while a Mamba-based 4D hypercorrelation module and contrastive enhancement generate robust dense prompts for precise mask decoding. The method achieves state-of-the-art results on PASCAL-5i and COCO-20i under 1-shot settings, with notable improvements in mIoU and strong out-of-distribution transfer to FSS-1000, all with a compact parameter footprint (as low as 0.07M trainable parameters in compact variants). This unified framework offers a practical, resource-efficient solution for leveraging large foundation models in FSS and sets a new direction for single-encoder designs that fuse discovery from SAM and DINOv2.

Abstract

Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.

UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation

TL;DR

UINO-FSS presents a unified, efficient approach to few-shot semantic segmentation by distilling SAM knowledge into a frozen DINOv2 encoder through a lightweight segmenter comprising a Bottleneck Adapter, Meta-Visual Prompt Generator, and a SAM-initialized decoder. A coarse-to-fine hierarchical distillation aligns cross-model embeddings, while a Mamba-based 4D hypercorrelation module and contrastive enhancement generate robust dense prompts for precise mask decoding. The method achieves state-of-the-art results on PASCAL-5i and COCO-20i under 1-shot settings, with notable improvements in mIoU and strong out-of-distribution transfer to FSS-1000, all with a compact parameter footprint (as low as 0.07M trainable parameters in compact variants). This unified framework offers a practical, resource-efficient solution for leveraging large foundation models in FSS and sets a new direction for single-encoder designs that fuse discovery from SAM and DINOv2.

Abstract

Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5 and COCO-20 show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5 and 64.5 (+4.1%) on COCO-20, demonstrating the effectiveness of our unified approach.

Paper Structure

This paper contains 33 sections, 14 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Introduction of our framework. Our framework (b) is unified and efficient compared to the existing dual-modal architectures liu2024matcherzhang2025bridgesun2024vrp shown in (a). Our lightweight segmenter (c) only contains 5.6M parameters, but gains knowledge from the powerful SAM via the procedure in (d).
  • Figure 2: Analyzation on the embeddings from DINOv2 and SAM. We compute the self-similarity map (SSM) of feature vectors on red dots to analyze the embedding distribution of both models and visualize them in the figure. Here, (a) shows the SSM of features from the last layer of SAM's encoder, while (b) displays the SSM of features from all layers of DINOv2-Base. Unlike SAM's holistic semantic focus, DINOv2's high-level embeddings concentrate on various local regions, offering richer representation. Despite this, we found that embeddings from DINOv2's 3rd layer are most similar to SAM's encoder embeddings. This observation enables efficient cross-model distillation for our lightweight segmenter.
  • Figure 3: The proposed UINO-FSS architecture. Our architecture consists of a DINOv2 encoder and a lightweight segmenter that includes a bottleneck adapter (BA), a Meta-Visual Prompt Generator (MVPG) and a mask decoder. Upper part is the coarse-to-fine cross-model distillation procedure, with only the adapter trainable for feature matching. Below is the overall architecture of our few-shot semantic segmentation model. Our MVPG includes two modules for the Semantic-aware Visual Prompts (SVP) and the Mamba-based dense prompts, respectively.
  • Figure 4: Network structure for the Mamba-HyperCorrelation Module (MHCM). MHCM stacks two Hierarchical Global Modeling Blocks (HGMB), which process 4D volumetric correlation while maintaining high efficiency.
  • Figure 5: Qualitative results of UINO-FSS on COCO-20$^i$ under the one-shot setting. From top to bottom in each row are the support image with its corresponding mask, the query image with ground-truth annotation, the output of UINO-FSS without CE and MHCM modules, the output of UINO-FSS without CE, and the output of the complete UINO-FSS model. Red circles indicate inaccurate segmentation.