Table of Contents
Fetching ...

Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation

Xiaoyun Zheng, Liwei Liao, Jianbo Jiao, Feng Gao, Ronggang Wang

TL;DR

Surface-SOS tackles self-supervised object segmentation from multi-view images by learning a geometrically consistent neural surface representation. It decomposes scenes into foreground and background with two neural modules, FoCoR and BaCo, built on a hash-encoded $SDF$ and optimized via differentiable volume rendering and losses such as $L_{Eikonal}$ and $L_{sparsity}$, with optional coarse masks to speed up convergence. The approach yields finer, view-consistent foreground masks and completed backgrounds across forward-facing, object-centric, and real-world dynamic scenes, outperforming NeRF-based methods and supervised single-view baselines while reducing annotation needs. This work advances robust 3D-aware segmentation for a wide range of scenes and setups, and provides code to enable broader adoption.

Abstract

Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS.

Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation

TL;DR

Surface-SOS tackles self-supervised object segmentation from multi-view images by learning a geometrically consistent neural surface representation. It decomposes scenes into foreground and background with two neural modules, FoCoR and BaCo, built on a hash-encoded and optimized via differentiable volume rendering and losses such as and , with optional coarse masks to speed up convergence. The approach yields finer, view-consistent foreground masks and completed backgrounds across forward-facing, object-centric, and real-world dynamic scenes, outperforming NeRF-based methods and supervised single-view baselines while reducing annotation needs. This work advances robust 3D-aware segmentation for a wide range of scenes and setups, and provides code to enable broader adoption.

Abstract

Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS.
Paper Structure (19 sections, 12 equations, 12 figures, 3 tables)

This paper contains 19 sections, 12 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We present Surface-SOS, in which multi-view geometric constraints are embedded in the form of dense one-to-one mapping in 3D surface representation. Given multi-view images as input, Surface-SOS predicts convincing results including object masks, foregrounds and backgrounds.
  • Figure 2: Method overview. For the scene captured by N images $\{{I_i}\}^N_{i=1}$, we use COLMAP colmap and Mask-RCNN maskrcnn to get sparse 3D points and coarse object masks as co-inputs, and predict a dense, geometrical consistent object map, as well as a textural, completed background for each image. Note that the coarse mask is optional and merely expedites the convergence of 3D surface representation. Moreover, by introducing coarse masks as additional input, Surface-SOS is able to refine segmentation remarkably (see the under-segmentation and over-segmentation highlighted in red and yellow, respectively). Surface-SOS consists of two complementary representation modules: a Foreground Consistent Representation (FoCoR) module and a Background Completion (BaCo) module. FoCoR: For every image, given a 3D point $p(x,y,z)$, we concatenate its queried feature from the multi-resolution hash grid as the input to the SDF network. The SDF network outputs the geometry feature and SDF value, which are combined with the viewing direction and further fed into the RGB network to generate RGB value for the foreground, as well as the alpha $\alpha$ prediction. BaCo: Given a sequence of multi-view images, we concatenate its static features from the multi-resolution hash grid and its 3D position $p(x,y,z)$ as the input to the SDF network. Here, we crop the foreground from the probability map region $m^P$ by setting the SDF value to a positive number (e.g. 1.0). Then the SDF value $\sigma^B$ and geometry feature vectors $\mathbf{F}_{geo}^B$ are combined with the viewing direction $\mathbf {v} \in \mathbb{S}^2$ and further fed into the RGB network $\mathrm{M}_{\mathcal{C}}$ to generate the RGB value for the background $c^B$. After removing the foreground from the probability map $m^P$, even though some parts of the background were occluded in the original view, the other views of the scene provide sufficient textural/structural information to complete the missing background. All parts of the proposed pipeline are trained end-to-end with the geometric and photometric losses in a self-supervised manner with the original input images.
  • Figure 3: A visualization of the architecture of FoCoR and BaCo module.
  • Figure 4: Comparison on the forward-facing scenes Flower, Fortress, and horns from LLFF dataset llff. In the third column, DINO-CoSeg dino_coseg mistakenly matches several discrete patches, as DINO has higher activation on just a few tokens, which may lead to view-inconsistent and disconnected co-segmentation results. Compared to SAM segany and DINO-CoSeg, our results have more accurate edges, since our network can exploit multi-scale geometry features to better capture the matte objects. Compared with NeRF-based methods (i.e. Semantic-NeRF semanticNeRF, NeRF-SOS nerfsos, and RFPrfp), Surface-SOS (g) produces view-consistent masks with finer details and no holes in the interior of objects.
  • Figure 5: Qualitative comparisons on object-centric scenes Biclcle and Backpack from CO3D data co3d. Despite SAM segany providing fine-grained boundary information it is noisy and misses more valid detection than ours. Whereas the proposed method achieves high-quality geometric and textural consistent foreground maps without inducing noise, e.g., it can recover the complex structures of the bicycle frame and render detailed textures in the Bicycle example.
  • ...and 7 more figures