Table of Contents
Fetching ...

JRM: Joint Reconstruction Model for Multiple Objects without Alignment

Qirui Wu, Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Richard Newcombe, Angel X. Chang, Jakob Engel, Henry Howard-Jenkins

Abstract

Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

JRM: Joint Reconstruction Model for Multiple Objects without Alignment

Abstract

Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

Paper Structure

This paper contains 24 sections, 1 equation, 19 figures, 7 tables.

Figures (19)

  • Figure 1: We address the challenge of compositional scene reconstruction with objects re-observed across space and time. We characterize this into three concrete cases (left): spatial repetition, temporal repetition and articulation dynamics. We propose the Joint Reconstruction Model (JRM) to perform coupled reconstruction of a group of objects, out-performing reconstruction of each individually (right).
  • Figure 2: Comparison between different approaches to object-centric reconstruction. JRM offers a relaxation of explicit alignment and registration techniques. Objects are jointly reconstructed, allowing information flow between them, but without imposing hard constraints on similarity.
  • Figure 3: (a) JRM jointly reconstructs two nightstands $k$ and $k'$ that appear in a single scan. (b) Our proposed coupled fusion block, in which the denoised tokens of two distinct objects attend to each other in the latent space to implicitly aggregate unaligned observations.
  • Figure 4: An example JRM reconstruction illustrating the benefit of joint reconstruction. From the first view alone it is not possible to discern the shape of the armrest. However, by adding in a source observation that observes the front, JRM is able to reconstruct the target view correctly.
  • Figure 5: Qualitative results on temporal instance repetition. Objects are colored from faded to solid tones reflecting chamfer distance to the ground truth. The reconstruction from JRM improves with more rescans, while those from FM degrade due to inaccurate alignments.
  • ...and 14 more figures