Table of Contents
Fetching ...

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

Liyuan Zhu, Shengyu Huang, Konrad Schindler, Iro Armeni

TL;DR

This work addresses the challenge of long-term dynamic 3D scene understanding with sparse temporal observations by proposing MoRE$^2$, a unified framework for living scenes. It introduces a compact SE($3$)-equivariant encoder–decoder based on Vector Neurons and a neural implicit DeepSDF decoder, enabling simultaneous instance matching, registration, and reconstruction, with a joint optimization that accumulates per-instance point clouds over time. The approach is trained on synthetic data and validated on both synthetic FlyingShape and real 3RScan datasets, achieving state-of-the-art end-to-end performance as well as improved subtask results for matching, registration, and reconstruction. The framework advances the concept of living scenes by progressively refining geometric completeness and pose accuracy as more temporal data becomes available, with potential applications in robotics, AR/VR, and digital twins.

Abstract

Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE, a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time. At the core of our method lies an SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

TL;DR

This work addresses the challenge of long-term dynamic 3D scene understanding with sparse temporal observations by proposing MoRE, a unified framework for living scenes. It introduces a compact SE()-equivariant encoder–decoder based on Vector Neurons and a neural implicit DeepSDF decoder, enabling simultaneous instance matching, registration, and reconstruction, with a joint optimization that accumulates per-instance point clouds over time. The approach is trained on synthetic data and validated on both synthetic FlyingShape and real 3RScan datasets, achieving state-of-the-art end-to-end performance as well as improved subtask results for matching, registration, and reconstruction. The framework advances the concept of living scenes by progressively refining geometric completeness and pose accuracy as more temporal data becomes available, with potential applications in robotics, AR/VR, and digital twins.

Abstract

Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE, a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time. At the core of our method lies an SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.
Paper Structure (71 sections, 25 equations, 17 figures, 10 tables, 2 algorithms)

This paper contains 71 sections, 25 equations, 17 figures, 10 tables, 2 algorithms.

Figures (17)

  • Figure 1: Living Scenes. A living scene is a 3D environment with multiple moving objects that evolves over time. (a) Two temporal observations (scans) represent the scene at times $(t_1, t_2)$ and capture the objects having moved around. To understand the change in the scene, given instance segmentation, we (b) match object point clouds from $t_1$ and $t_2$ that belong to the same instance; (c) register and reconstruct the matches through our joint optimization, (d) accumulate all point clouds per instance from the multiple temporal scans, improving the registration and reconstruction quality over time. We illustrate on two scans for simplicity.
  • Figure 2: Overview of the $\textsc{MoRE$^2$}$ pipeline. Given two temporal point clouds with instance masks $\{\mathbf{X}^{t_0}_i\}_{i=1}^3$ and $\{\mathbf{X}^{t_1}_i\}_{i=1}^3$, we first use the VN encoder to compute the embeddings for each instance. a) Matching solves the pairwise correspondences of the same instances using Hungarian matching munkres1957hugarian on the embeddings. b) Registration estimates 6DoF transformations within matched pairs: Kabsch algorithm kabsch1976solution is employed to compute the initial transform, followed by optimization to further refine the registration. c) Joint optimization simultaneously refines the registration and d) reconstruction. The output is the signed distance values (SDF) of query coordinates.
  • Figure 3: End-to-end cumulative reconstruction with multiple scans.$t_1, t_2$, and $t_3$ denote the same scene captured at three times. Point clouds from $t_2$ and $t_3$ are accumulated to $t_1$. Interestingly, chairs in $t_3$ (top) are removed from the scene, but $\textsc{MoRE$^2$}$ is able to handle it.
  • Figure 4: Multi-object relocalization on 3RScanwald2019rio. Instances, uniquely colored in source scan, are matched and registered to their corresponding instances in target scan, as per ground truth. $\searrow$ highlights differences between methods on registration and $\searrow$ on matching.
  • Figure 5: Multi-object matching on 3RScanwald2019rio. We repaint the instances in the source scan using the same colors as those of matched instances in the target scan. X denotes the wrongly matched instances. Curves depict the associations of moving objects. (5/7) denoting 5 correct matches out of 7 pairs in the scene.
  • ...and 12 more figures