Table of Contents
Fetching ...

Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction

Gan Chen, Ying He, Mulin Yu, F. Richard Yu, Gang Xu, Fei Ma, Ming Li, Guang Zhou

TL;DR

Inter3D tackles the intractable problem of modeling human-interactive objects with $n$ movable parts, where there are $2^n$ discrete states. It introduces a benchmark with a self-collected dataset and a novel evaluation protocol restricting training to canonical and individual-part states, while unseen combination states are tested, and presents a baseline method that combines Space Discrepancy Tensors with multi-resolution hash encoding via InstantNGP. The approach employs a Mutual State Regularization to maintain cross-state consistency and offers two occupancy-grid strategies to balance training speed and memory usage. Experimental results on four object categories show strong performance on novel state synthesis and clear advantages over existing static/dynamic 3D methods. These contributions provide a practical foundation for scalable, interactive 3D object reconstruction and synthesis.

Abstract

Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach.

Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction

TL;DR

Inter3D tackles the intractable problem of modeling human-interactive objects with movable parts, where there are discrete states. It introduces a benchmark with a self-collected dataset and a novel evaluation protocol restricting training to canonical and individual-part states, while unseen combination states are tested, and presents a baseline method that combines Space Discrepancy Tensors with multi-resolution hash encoding via InstantNGP. The approach employs a Mutual State Regularization to maintain cross-state consistency and offers two occupancy-grid strategies to balance training speed and memory usage. Experimental results on four object categories show strong performance on novel state synthesis and clear advantages over existing static/dynamic 3D methods. These contributions provide a practical foundation for scalable, interactive 3D object reconstruction and synthesis.

Abstract

Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach.

Paper Structure

This paper contains 18 sections, 10 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Illustration of reconstructing a human-interactive object, Furniture, in our Inter3D. The object consists of $n=3$ movable parts, highlighted in red, green, and olive-green. With $2^n=8$discrete states, the task requires significant computational and memory resources, making it impractical for existing methods to train separate models for each state. Furthermore, ensuring consistency in external appearances and internal structures across states poses a significant challenge when states are trained independently. In contrast, our approach efficiently synthesizes novel combination states by observing only the canonical and individual part states.
  • Figure 2: Overview of our approach, which comprises three key stages, i.e., Canonical Modelling, Movable Part Decomposition, and Arbitrary Combination Synthesis. The inactive movable parts in each state are covered by colored masks. In Canonical Modelling, the canonical state $S_0$ of the interactive object, with all $n$ movable parts closed, is reconstructed using InstantNGP. The spatial sample point features are retrieved via multi-resolution hash encoding, represented as ${\mathbf{h}}$, with attributes color $c_0$ and density $\sigma_0$ projected through a multi-layer perceptron (MLP). In Movable Part Decomposition, each movable part is manipulated sequentially, resulting in states where only one part is open, denoted as $S_i, i\in\{1,...,n\}$. For each $S_i$, the sample point features are represented as ${\mathbf{h}} * {\mathbf{t}}_i$, where ${\mathbf{t}}_i$ is derived from the proposed Space Discrepancy Tensors, encoding the differences between $S_i$ and $S_0$. To address camera trajectory perturbations across states, we introduce a Mutual State Regularization mechanism, which mitigates rendering artifacts caused by movable part misalignment. In Arbitrary Combination Synthesis, to render the combined novel state $S_i+S_j$, the sample point features with the maximum density difference, selected from $\{|\sigma_i-\sigma_0|, |\sigma_j-\sigma_0|\}$, are used for volumetric rendering, which enables the efficient modelling of arbitrary combinations of movable parts.
  • Figure 3: Data collection example of the object Car in our human-interactive benchmark Inter3D. A sequence of forward-facing images are captured for the canonical state $S_0$ with doors closed, the state $S_1$ with the front door open, and the state $S_2$ with the rear door open.
  • Figure 4: Illustration of synthesizing the novel combination state on the Car object. Given the features of the same sample point across the training states, its optimal feature on the novel state is determined by comparing the individual state $S_i$ with the canonical state $S_0$. The inactive movable parts in each state are covered by colored masks.
  • Figure 5: Illustration of our Mutual State Regularization mechanism on the Car object. MSR regularizes the sampling point features of the inactive movable part in the individual state $S_i$ with the canonical state $S_0$ through pixel alignment, and vice versa. $\mathcal{L}_1$ loss is applied to ensure consistency.
  • ...and 7 more figures