Table of Contents
Fetching ...

Annotation Free Semantic Segmentation with Vision Foundation Models

Soroush Seifi, Daniel Olmeda Reino, Fabien Despinoy, Rahaf Aljundi

TL;DR

This work generates free annotations for any semantic segmentation dataset using existing foundation models, and builds a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation.

Abstract

Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, we generate free annotations for any semantic segmentation dataset using existing foundation models. We use CLIP to detect objects and SAM to generate high quality object masks. Next, we build a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation. Our approach can bring language-based semantics to any pretrained vision encoder with minimal training, uses foundation models as the sole source of supervision and generalizes from little training data with no annotation.

Annotation Free Semantic Segmentation with Vision Foundation Models

TL;DR

This work generates free annotations for any semantic segmentation dataset using existing foundation models, and builds a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation.

Abstract

Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, we generate free annotations for any semantic segmentation dataset using existing foundation models. We use CLIP to detect objects and SAM to generate high quality object masks. Next, we build a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation. Our approach can bring language-based semantics to any pretrained vision encoder with minimal training, uses foundation models as the sole source of supervision and generalizes from little training data with no annotation.
Paper Structure (22 sections, 4 equations, 6 figures, 4 tables)

This paper contains 22 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our method (FMbSeg). a) Label Generation: We detect object and background categories in an image using a frozen pretrained image-text model (e.g. CLIP, in pink). We select image patches with high similarity to the text representation of the detected categories. We pass the location of those patches to a mask proposal network (e.g. SAM, in green). b) Visual Features Alignment: We use the generated segmentations and detected categories to align features from a more expressive frozen image encoder (e.g. DINOv2, in blue) with a frozen pretrained text encoder. c) Test-Time Inference: At test time, the newly aligned image encoder projects image features into text space. Every pixel is classified according to their similarity to the pre-computed text prototypes of a target ontology.
  • Figure 2: Patch level alignment between image and class. First row shows images from Pascal VOC. Second row shows the similarity between patch features from CLIP and the text features of the detected category. Third row shows the similarity map after aligning a DINOv2 model with FMbSeg.
  • Figure 3: Qualitative evaluation of Stage 1.1. SAM query points generated by our method are shown in green stars. Left shows instances of correct segmentations by Stage 1.1. and Right demonstrates its limitations; small objects, wrongly detected classes (due to ambiguities) and not enough query points to cover all instances. Stage 1.2 alleviates the issue with small objects and incomplete masks since it labels all the masks generated accurately by SAM.
  • Figure 4: Qualitative results of zeroshot segmentation. The first row shows the ground truth labels. The second row shows the results of FMbSeg-Stage 2 (refined).
  • Figure 5: Water segmentation results: Stage 1 accurately segments bodies of water in presence of anomalies and different lighting conditions achieving a $83.1\%$ mIoU accuracy.
  • ...and 1 more figures