Table of Contents
Fetching ...

CHOICE: Coordinated Human-Object Interaction in Cluttered Environments for Pick-and-Place Actions

Jintao Lu, He Zhang, Yuting Ye, Takaaki Shiratori, Sebastian Starke, Taku Komura

TL;DR

This work tackles the problem of synthesizing realistic, long-horizon human-object interactions in cluttered environments by introducing CHOICE, a hierarchical system that combines a neural implicit trajectory planner, a bimanual task scheduler, and a DeepPhase-based controller. The trajectory planner uses a three-field implicit representation $\big(D_t, D_o, D_{toa}\big)$ with a time-of-arrival field learned from motion capture and an auto-decoder to generalize to unseen scenes, producing collision-free wrist trajectories. The DeepPhase controller employs a linear dynamical formulation in the phase latent space and a Kalman filter to robustly track goal-phase states, enabling smooth full-body coordination across hands and the hip. A dedicated navigation and scheduling module choreographs 2D path planning, motion matching, and bimanual task sequencing, while a large MoCap-based CHOICE dataset supports training. Empirical results show stronger motion realism, higher safety distances, and about a 96% success rate on unseen layouts, demonstrating substantial generalization to novel cluttered scenes and complex containers. The framework advances realistic, full-body interaction synthesis with meaningful implications for animation and robotics in real-world environments.

Abstract

Animating human-scene interactions such as pick-and-place tasks in cluttered, complex layouts is a challenging task, with objects of a wide variation of geometries and articulation under scenarios with various obstacles. The main difficulty lies in the sparsity of the motion data compared to the wide variation of the objects and environments as well as the poor availability of transition motions between different tasks, increasing the complexity of the generalization to arbitrary conditions. To cope with this issue, we develop a system that tackles the interaction synthesis problem as a hierarchical goal-driven task. Firstly, we develop a bimanual scheduler that plans a set of keyframes for simultaneously controlling the two hands to efficiently achieve the pick-and-place task from an abstract goal signal such as the target object selected by the user. Next, we develop a neural implicit planner that generates guidance hand trajectories under diverse object shape/types and obstacle layouts. Finally, we propose a linear dynamic model for our DeepPhase controller that incorporates a Kalman filter to enable smooth transitions in the frequency domain, resulting in a more realistic and effective multi-objective control of the character.Our system can produce a wide range of natural pick-and-place movements with respect to the geometry of objects, the articulation of containers and the layout of the objects in the scene.

CHOICE: Coordinated Human-Object Interaction in Cluttered Environments for Pick-and-Place Actions

TL;DR

This work tackles the problem of synthesizing realistic, long-horizon human-object interactions in cluttered environments by introducing CHOICE, a hierarchical system that combines a neural implicit trajectory planner, a bimanual task scheduler, and a DeepPhase-based controller. The trajectory planner uses a three-field implicit representation with a time-of-arrival field learned from motion capture and an auto-decoder to generalize to unseen scenes, producing collision-free wrist trajectories. The DeepPhase controller employs a linear dynamical formulation in the phase latent space and a Kalman filter to robustly track goal-phase states, enabling smooth full-body coordination across hands and the hip. A dedicated navigation and scheduling module choreographs 2D path planning, motion matching, and bimanual task sequencing, while a large MoCap-based CHOICE dataset supports training. Empirical results show stronger motion realism, higher safety distances, and about a 96% success rate on unseen layouts, demonstrating substantial generalization to novel cluttered scenes and complex containers. The framework advances realistic, full-body interaction synthesis with meaningful implications for animation and robotics in real-world environments.

Abstract

Animating human-scene interactions such as pick-and-place tasks in cluttered, complex layouts is a challenging task, with objects of a wide variation of geometries and articulation under scenarios with various obstacles. The main difficulty lies in the sparsity of the motion data compared to the wide variation of the objects and environments as well as the poor availability of transition motions between different tasks, increasing the complexity of the generalization to arbitrary conditions. To cope with this issue, we develop a system that tackles the interaction synthesis problem as a hierarchical goal-driven task. Firstly, we develop a bimanual scheduler that plans a set of keyframes for simultaneously controlling the two hands to efficiently achieve the pick-and-place task from an abstract goal signal such as the target object selected by the user. Next, we develop a neural implicit planner that generates guidance hand trajectories under diverse object shape/types and obstacle layouts. Finally, we propose a linear dynamic model for our DeepPhase controller that incorporates a Kalman filter to enable smooth transitions in the frequency domain, resulting in a more realistic and effective multi-objective control of the character.Our system can produce a wide range of natural pick-and-place movements with respect to the geometry of objects, the articulation of containers and the layout of the objects in the scene.

Paper Structure

This paper contains 45 sections, 24 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of our CHOICE System: Perceiving the involved objects around the clicking goal object (pic.1) with the action instruction from the keyboard, our system arrange tasks and motions to the empty hands, and match bimanual goals based on the character state and the goal tasks (green block that outputs pic.2). The matched hand goal priors are then re-planned by our trajectory planning sub-system (in dashed brown) to generate a trajectory of manipulation goals that fit the runtime environment (pic.4). After planning a set of navigation goals (pic.3) alongside the manipulation trajectory, our goal coordination arranges the goal for the joints (pic.5) and sets a goal phase prior for the current character. The second major stage of our system is the goal-driven motion controller. It auto-regressively updates the character's motion towards the goal of keyjoints, forming a loop with a Kalman filter, which corrects and sets the goal phase features.
  • Figure 2: Implicit neural trajectory planner: The scene visualization of the three fields uses the green color to show its value, deeper green represents a lower distance, and the blue color highlights the region of infinity distance. The right-side images give the 2D slices at the teapot height, where the zero-level set was shown in orange curves. Under a test scene, $\mathbf{z}$ was optimized to reconstruct the known part of the output, which was encircled by the blue rectangles. The dashed blue rectangle illustrates the pre-known part of the time-of-arrival field.
  • Figure 3: Framework of Our DeepPhase interaction controller: The Kalman filter estimates the target phase correlated to the goal key-joint transformations. The Gating Network compares the features of key-joint transformations and the phase features from the current and goal frame, and after each motion prediction for the next frame, our bi-directional control blends the key-joint transformation prediction, and also feeds back the covariance to the Kalman filter based on the displacement of the bi-directional prediction.
  • Figure 4: The distribution of amplitude and frequency control in $\mathbf{U}$, revealing the natural motion transitions, which exhibit consistent acceleration and deceleration patterns and perform the motion diversity during interactions, follows the zero-mean Gaussian distribution in the frequency-domain latent.
  • Figure 5: Overview of our state-machine structure for synthesis coordinate interaction-guiding goals (corresponds to Fig. \ref{['fig:pipeline']}, the goal coordination block before getting the fused goal). It adaptively allocates coordinated goals for both the hands and the entire body according to the test environment as described in § \ref{['sec:bimanual']}. The navigation goal before arrival (see § \ref{['sec:navigation']}) and the interaction goals during manipulation (plot in green capsules) are sequentially generated to guide the DeepPhase interaction controller.
  • ...and 7 more figures