Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects
Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt
TL;DR
This work tackles the challenge of separable 3D reconstruction for two interacting objects from multi-view RGB data. It proposes Betsu-Betsu, a markerless, template-free framework that jointly encodes two objects with a shared multi-resolution hashgrid and outputs two SDFs, with an alpha-blending regularisation to enforce disjoint opacities and prevent interpenetration, enabling clean boundaries and novel-view synthesis. The method is validated across diverse interaction scenarios (human-object, hand-object, human-human, object-object) and datasets, including a new human-object dataset, showing improvements in geometry and appearance over state-of-the-art baselines like ObjectSDF++ and Segmented Neus2. The results demonstrate the practicality and generality of a two-object, template-free, compositional reconstruction approach, with potential extensions to larger scenes and template-guided refinements for further accuracy.
Abstract
Separable 3D reconstruction of multiple objects from multi-view RGB images -- resulting in two different 3D shapes for the two objects with a clear separation between them -- remains a sparsely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.
