Table of Contents
Fetching ...

Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects

Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt

TL;DR

This work tackles the challenge of separable 3D reconstruction for two interacting objects from multi-view RGB data. It proposes Betsu-Betsu, a markerless, template-free framework that jointly encodes two objects with a shared multi-resolution hashgrid and outputs two SDFs, with an alpha-blending regularisation to enforce disjoint opacities and prevent interpenetration, enabling clean boundaries and novel-view synthesis. The method is validated across diverse interaction scenarios (human-object, hand-object, human-human, object-object) and datasets, including a new human-object dataset, showing improvements in geometry and appearance over state-of-the-art baselines like ObjectSDF++ and Segmented Neus2. The results demonstrate the practicality and generality of a two-object, template-free, compositional reconstruction approach, with potential extensions to larger scenes and template-guided refinements for further accuracy.

Abstract

Separable 3D reconstruction of multiple objects from multi-view RGB images -- resulting in two different 3D shapes for the two objects with a clear separation between them -- remains a sparsely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.

Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects

TL;DR

This work tackles the challenge of separable 3D reconstruction for two interacting objects from multi-view RGB data. It proposes Betsu-Betsu, a markerless, template-free framework that jointly encodes two objects with a shared multi-resolution hashgrid and outputs two SDFs, with an alpha-blending regularisation to enforce disjoint opacities and prevent interpenetration, enabling clean boundaries and novel-view synthesis. The method is validated across diverse interaction scenarios (human-object, hand-object, human-human, object-object) and datasets, including a new human-object dataset, showing improvements in geometry and appearance over state-of-the-art baselines like ObjectSDF++ and Segmented Neus2. The results demonstrate the practicality and generality of a two-object, template-free, compositional reconstruction approach, with potential extensions to larger scenes and template-guided refinements for further accuracy.

Abstract

Separable 3D reconstruction of multiple objects from multi-view RGB images -- resulting in two different 3D shapes for the two objects with a clear separation between them -- remains a sparsely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.

Paper Structure

This paper contains 30 sections, 10 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Our method reconstructs humans and objects in 3D from segmented multi-view (MV) RGB images (top) in a separable way, i.e. with clean boundaries and no inter-penetration. (Bottom:) For each of the three scenes (Sparring, Pikachu and Laptop Demonstration), we show the two join recovered geometries (left), individual novel view renderings (top right) and individual geometries.
  • Figure 2: Schematic overview of our framework. We semantically segment the input multi-view images into the background and the areas corresponding to two interacting objects. The scene is encoded using a shared, multi-resolution hash grid encoding $\mathbf{e}$ and the shared features are decoded using two separate SDF MLPs to produce corresponding SDFs $\Phi_1$ and $\Phi_2$. The per-point colour $\mathcal{C}_s$ is estimated from the joint scene SDF composed using $\Phi_s = \Phi_1 \cup \Phi_2$. Finally, we integrate the colours of the sampled points in the ray by $\alpha$-blending the individual opacities, $\alpha_1$ and $\alpha_2$, ensuring clean separation boundaries between the two (see \ref{['eq:color_compositing']}). The entire framework is supervised using the rendering loss and additional regularisers (see \ref{['eq:total_loss']}).
  • Figure 3: Qualitative comparison of the reconstructed geometry. In most scenes, we obtain better geometry, with fewer deformations near the contact regions. Best viewed when zoomed.
  • Figure 4: Qualitative comparison of 3D scene reconstructions with human-human interaction along with selected multi-view (MV) input images. Digital zoom recommended.
  • Figure 5: Qualitative comparison of reconstruction of scenes involving two objects in proximity, along with samples from the multi-view (MV) input images. Digital zoom recommended.
  • ...and 15 more figures