Table of Contents
Fetching ...

Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera

Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl

TL;DR

The paper addresses reconstructing and estimating the pose of free-moving objects from monocular RGB video without relying on priors or segmentation. It introduces a virtual camera that focuses optimization on the object center, enabling globally-consistent joint optimization of shape and pose using an implicit neural surface representation learned from the video and rendered via volume rendering. A segment-free progressive training scheme and a real-camera refinement step (PnP with RANSAC) provide robust initialization and accurate final results. Evaluations on HO3D and egocentric RGB sequences show significant improvements over prior pose-free methods and competitiveness with methods that use hand or object priors, broadening applicability to AR/VR and robotics.

Abstract

We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.

Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera

TL;DR

The paper addresses reconstructing and estimating the pose of free-moving objects from monocular RGB video without relying on priors or segmentation. It introduces a virtual camera that focuses optimization on the object center, enabling globally-consistent joint optimization of shape and pose using an implicit neural surface representation learned from the video and rendered via volume rendering. A segment-free progressive training scheme and a real-camera refinement step (PnP with RANSAC) provide robust initialization and accurate final results. Evaluations on HO3D and egocentric RGB sequences show significant improvements over prior pose-free methods and competitiveness with methods that use hand or object priors, broadening applicability to AR/VR and robotics.

Abstract

We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.
Paper Structure (15 sections, 6 equations, 12 figures, 4 tables)

This paper contains 15 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Joint object reconstruction and pose estimation from a monocular RGB video.(a)Top: the standard reconstruction setting with static object captured by a moving camera, which relies on the geometry cue from the whole scene and is inapplicable to dynamic objects. Middle: the setting with rotating a hand-held object in front of a fixed camera. Bottom: the setting does not assume any prior, which allows the objects to be moved freely with any grasping style. (b, c) Our method outperforms state of the art on the HO3D dataset with fixed camera, and produces accurate results on free-moving objects with egocentric views.
  • Figure 2: Paradigms of pose-free object reconstruction.Top: The input sequence. Middle: Existing method meta-obj relies on segment-wise joint optimization based on multiple easy segments of the sequence, as shown with different colors in the pose trajectories, which tends to be local optimal. Bottom: Our method optimizes object shape and pose progressively without any segments, producing globally-consistent shape and pose results.
  • Figure 3: Overview of our method. We first use off-the-shelf 2D segmentation methods to get object mask in each frame, and then optimize the MLP networks w.r.t. a virtual camera system, with which the camera always points to areas near the object center, as illustrated as colored 3D axis in the figure. We optimize three MLPs with progressively added images. For each frame with time index $t$, we use the Pose MLP to predict the object pose $(\mathbf{R}, s)$, which corresponds to the rotation and the distance from the camera center to the object center, summing up to only 4 degrees of freedom. For each 3D point $\mathbf{x}$ along the view direction $\mathbf{v}$, we use the SDF MLP and Color MLP to predict its corresponding SDF value and color opacity, respectively. We compare the rendered image with the input and update the MLP networks based on volume rendering. We finally conduct the virtual-to-real conversion and refine all the results w.r.t. the real camera.
  • Figure 4: Different methods for joint pose and shape optimization.(a) BARF barf struggles in handling 360-degree sequences. (b) The segment-wise optimization of Hampali et al. meta-obj is local optimal and suffers in this scenario with large pose changes. (c) Our method produces globally-consistent results. We visualize the ground truth pose and the predicted pose in cyan and purple, respectively.
  • Figure 5: Effect of the virtual camera. The top row shows the trajectory of the object w.r.t. the real camera and the virtual camera, respectively. The bottom row shows the heatmap of 2D reprojections of the 3D object center across the whole HO3D dataset w.r.t different camera systems. The poses w.r.t the virtual camera do not have significant magnitude in both horizontal and vertical directions, which allows the poses to be approximately captured by only 4 degrees of freedom (3 for rotation and 1 for distance).
  • ...and 7 more figures