Table of Contents
Fetching ...

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas

TL;DR

NIFTY tackles realistic 3D human-object interaction synthesis by coupling a neural object interaction field with an object-conditioned diffusion model and a scalable synthetic data pipeline. The method uses a SMPL-based pose diffusion conditioned on object geometry, guided by a learned field that encodes the interaction manifold, and is trained with large-scale synthetic data generated from a small set of anchor poses via reverse-time HuMoR rollouts. Key contributions include the object interaction field, the diffusion-guided sampling framework, and the automated data-generation pipeline, which collectively yield higher-quality, more plausible interactions (e.g., sitting and lifting) across diverse objects, with favorable quantitative metrics and user study results. This approach reduces data requirements for learning human-object interactions and enables flexible, object-aware motion synthesis in realistic scenes.

Abstract

We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

TL;DR

NIFTY tackles realistic 3D human-object interaction synthesis by coupling a neural object interaction field with an object-conditioned diffusion model and a scalable synthetic data pipeline. The method uses a SMPL-based pose diffusion conditioned on object geometry, guided by a learned field that encodes the interaction manifold, and is trained with large-scale synthetic data generated from a small set of anchor poses via reverse-time HuMoR rollouts. Key contributions include the object interaction field, the diffusion-guided sampling framework, and the automated data-generation pipeline, which collectively yield higher-quality, more plausible interactions (e.g., sitting and lifting) across diverse objects, with favorable quantitative metrics and user study results. This approach reduces data requirements for learning human-object interactions and enables flexible, object-aware motion synthesis in realistic scenes.

Abstract

We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.
Paper Structure (28 sections, 4 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 4 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: NIFTY Overview. (Left) Our learned object interaction field guides an object-conditioned diffusion model during sampling to generate plausible human-object interactions like sitting. (Right) Our automated training data synthesis pipeline generates data for this model by combining a scene-unaware motion model with small quantities of annotated interaction anchor pose data.
  • Figure 2: Model Architecture. Our full motion synthesis method (middle) consists of an object interaction field$F_\phi$ (left), which guides the diffusion model$M_\theta$ (right) at sampling time to produce plausible interaction motions. At each step $k \in [0, K=1000]$ of denoising, the diffusion model predicts a clean motion $\hat{\boldsymbol{\tau}}^{0}$ from a noisy motion input $\boldsymbol{\tau}^{k}$ and conditioning information. The object interaction field takes the last pose from the diffusion output as input, and uses guidance to push the pose towards the valid interaction manifold using a predicted pose correction.
  • Figure 3: Interaction Field Visualization. We query the field in several locations with a sitting pose (a subset shown in grey) and visualize the output for pelvis, feet, and neck joints. All cylinders are oriented towards the chair, indicating the correction vector's magnitude and direction. This correction is due to the misalignment between the sitting pose and chair position.
  • Figure 4: Generated Synthetic Data. We visualize motion sequences from one tree rollout for one sitting anchor pose. The middle shows a bird's-eye view of the pelvis joint trajectories in light pink. All trajectories end in the same sitting pose, but start at diverse locations around the chair. We highlight a few trajectories in blue and show full-body motions from the corresponding generations on the left and right sides. Our complete dataset contains many trees for different objects and humans.
  • Figure 5: User Study. NIFTY is preferred $\ge$ 88.7% of the time for sitting and $\ge$81.6% for lifting compared to baselines. Our motions are also nearly indistinguishable from synthetic data trajectories.
  • ...and 8 more figures