Table of Contents
Fetching ...

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski

TL;DR

This work tackles unsupervised semantic segmentation by introducing DepthG, a method that injects 3D scene structure into contrastive learning. It combines a Depth-Feature Correlation loss with a Depth-Guided Farthest Point Sampling scheme to align 3D distances with feature similarities and to sample features in geometry-aware ways. Depths are obtained with zero-shot monocular depth estimators during training, and the method remains depth-free at test time. Across COCO-Stuff, Cityscapes, and Potsdam, DepthG achieves state-of-the-art unsupervised results, highlighting the value of incorporating depth priors into self-supervised segmentation.

Abstract

Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

TL;DR

This work tackles unsupervised semantic segmentation by introducing DepthG, a method that injects 3D scene structure into contrastive learning. It combines a Depth-Feature Correlation loss with a Depth-Guided Farthest Point Sampling scheme to align 3D distances with feature similarities and to sample features in geometry-aware ways. Depths are obtained with zero-shot monocular depth estimators during training, and the method remains depth-free at test time. Across COCO-Stuff, Cityscapes, and Potsdam, DepthG achieves state-of-the-art unsupervised results, highlighting the value of incorporating depth priors into self-supervised segmentation.

Abstract

Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.
Paper Structure (39 sections, 7 equations, 12 figures, 12 tables)

This paper contains 39 sections, 7 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Guiding the feature space for unsupervised segmentation with depth information. Our intuition behind the proposed approach is simple: For locations in the 3D space with a low distance, we guide the model to map their features closer together. Vice versa, the features are learned to be drawn apart in feature space if their distance in the metric space is large.
  • Figure 2: Overview of the DepthG training process. After 5-cropping the image, each crop is encoded by the DINO-pretrained ViT $\mathcal{F}$ to output a feature map. Using farthest point sampling (FPS), we sample the 3D space equally and convert the coordinates to select samples in the feature map. The sampled features are further transformed by the segmentation head $\mathcal{S}$. For both feature maps, the correlation tensor is computed. Following, we sample the depth map at the coordinates obtained by FPS and compute a correlation tensor in the same fashion. Finally, we compute our Depth-Feature Correlation loss and combine it with the feature distillation loss from STEGO. We guide the model to learn depth-feature correlation for crops of the same image, while the feature distillation loss is also applied to k-NN-selected and random images.
  • Figure 3: Local Hidden Positives. We visualize the use of depth and attention maps for local hidden positives. For this visualization, we sample the respective propagation maps at the yellow patch in the center of the crops. We observe the depth map to have sharper borders and more consistent propagation values. We experiment with both propagation strategies in Section \ref{['sec:lhp-abl']}.
  • Figure 4: Qualitative results. We show qualitative differences for plain STEGO compared to STEGO with our depth guidance, using ViT-S models for COCO and ViT-B for Cityscapes. Where STEGO struggles to differentiate instances, our model is able to correct this and successfully separates them for segmentation. In the case of the building in (a), our method alleviates visual irritations from the pixel space and corrects the segmentation of the building. In (b), our model is able to better handle visual inconsistencies from shadows.
  • Figure 5: Random vs. Farthest Point Sampling. We observe that random sampling can miss entire structures like trees in the first top and the plane in the bottom row. In contrast, our method meaningfully samples the depth space and selects locations across the different structures and at depth edges. We show further illustrations of FPS in the appendix.
  • ...and 7 more figures