Table of Contents
Fetching ...

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

Jonas Ernst, Wolfgang Boettcher, Lukas Hoyer, Jan Eric Lenssen, Bernt Schiele

TL;DR

Rewis3d is presented, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images and enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision.

Abstract

We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

TL;DR

Rewis3d is presented, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images and enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision.

Abstract

We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
Paper Structure (25 sections, 7 equations, 11 figures, 13 tables)

This paper contains 25 sections, 7 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Rewis3d -- Left: Our method (Revis3d) greatly improves performance for weakly supervised segmentation, trained with point and scribble labels. Notably, we improve robustness to scale changes in objects and more precise class boundaries. Right: We consistently outperform previous SOTA methods on a range of datasets and a variety of sparse annotations by significant margins.
  • Figure 2: Conceptual overview of weakly-supervised segmentation approaches. (a) Traditional methods rely solely on sparse 2D annotations, limiting supervision propagation. (b) Our proposed method Rewis3d introduces a 3D branch, enforcing cross-modal consistency (CMC) between 2D predictions and 3D predictions from reconstructed geometry.
  • Figure 3: Overview of the training pipeline. Our framework operates in two stages. Base Training (blue and green) establishes independent student-teacher setups for each modality using sparse supervision. Cross-Modal Consistency (orange) introduces our core contribution: bidirectional knowledge transfer where the teacher of one modality supervises the student of the other, weighted by our dual confidence mechanism leveraging prediction certainty and reconstruction quality.
  • Figure 4: Sparse label accumulation. Firstly, an image sequence is unprojected to a 3D point cloud via a multi-view reconstruction model. Subsequently, we establish correspondences between the 3D points and the 2D pixels in the source images. This allows for label accumulation in the 3D space, and by projection, also in the 2D images.
  • Figure 5: Qualitative comparison across outdoor and indoor datasets. Rewis3d produces sharper boundaries, more accurate fine-grained predictions, and better long-range segmentation compared to the Mean Teacher baseline (EMA) and TEL, even in regions where 3D reconstruction is uncertain. Colormaps are provided in the appendix.
  • ...and 6 more figures