Table of Contents
Fetching ...

Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

Mingxuan Wu, Huang Huang, Justin Kerr, Chung Min Kim, Anthony Zhang, Brent Yi, Angjoo Kanazawa

TL;DR

POD addresses the challenge of reconstructing 4D object configurations from monocular long-form video where depth ambiguity and occlusion hinder both purely predictive and purely optimization-based methods. It proposes a self-improving cycle, Predict-Optimize-Distill, that alternates training a frame-level predictor conditioned on RGB with global optimization via inverse rendering, and then distills optimized trajectories back into the predictor. The approach achieves substantial gains over purely optimization baselines across 14 real and 5 synthetic objects, and its performance scales with video length and additional cycle iterations. This Real-to-Sim-to-Real framework enables robust 4D understanding of articulated objects from casual monocular video, with implications for interactive manipulation and robotics.

Abstract

Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.

Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

TL;DR

POD addresses the challenge of reconstructing 4D object configurations from monocular long-form video where depth ambiguity and occlusion hinder both purely predictive and purely optimization-based methods. It proposes a self-improving cycle, Predict-Optimize-Distill, that alternates training a frame-level predictor conditioned on RGB with global optimization via inverse rendering, and then distills optimized trajectories back into the predictor. The approach achieves substantial gains over purely optimization baselines across 14 real and 5 synthetic objects, and its performance scales with video length and additional cycle iterations. This Real-to-Sim-to-Real framework enables robust 4D understanding of articulated objects from casual monocular video, with implications for interactive manipulation and robotics.

Abstract

Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.

Paper Structure

This paper contains 18 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Predict-Optimize-Distill (POD) takes in a multi-view scan of an object and casually captured long-form human interaction video, and estimates 3D part poses over time. (A) Existing optimization-based methods experience failures under heavy occlusion or when incremental frame optimization drifts. (B) In contrast, POD utilizes a cycle consisting of a predictive feed-forward model, an optimization stage, and a self-distillation phase to iteratively improve the object part pose predictions. By training an object part pose prediction model, POD can predict correct part poses even under heavy occlusions in the observations (red circle).
  • Figure 2: POD Pipeline: POD builds a 3D object model from a multi-view scan of the object using 3D Gaussian Splatting 3dgs and GARField garfield. In the predict stage, POD estimates the object's part poses from a monocular video with a feed-forward model conditioned on RGB images. Using these predictions it optimizes poses against the monocular video observations, utilizing quasi-multiview supervision by finding matching frames with similar predicted poses and jointly optimizing them. POD then distills the optimized poses back into the predictive model by generating synthetic data from novel views by applying part poses to the object 3D model and rendering RGB observations from different camera poses.
  • Figure 3: Example real objects used for qualitative evaluation. (Top) RGB renderings, (Bottom) part segmentations. Please see supplemental for more.
  • Figure 4: Synthetic objects used for quantitative evaluation. (Top) RGB renderings, (Bottom) part segmentations.
  • Figure 5: Quasi-multiview ablation. While POD without quasi-multiview generally avoids catastrophic failures, temporal consistency and photometric losses alone fail to correct inter-part pose errors due to depth ambiguity, causing poses to appear disconnected from novel viewpoints (red circles). The quasi-multiview signal from frame matching effectively resolves these errors.
  • ...and 4 more figures