Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

James F. Mullen; Divya Kothandaraman; Aniket Bera; Dinesh Manocha

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

James F. Mullen, Divya Kothandaraman, Aniket Bera, Dinesh Manocha

TL;DR

The paper tackles the problem of placing 3D human animations into static 3D scenes while preserving interactions by introducing PAAK, a keyframe-driven framework. It combines Geometric Keyframes and Active Keyframes, using an energy function $E(\tau, \theta)$ that balances scene affordances against penetration losses to guide placement, with per-frame interaction cues from POSA and a BADGE-based diversity mechanism to select informative frames. Evaluations on the PROX dataset with perceptual user studies show PAAK yields more realistic placements than PROX ground truth and competing baselines, demonstrating the value of keyframe-driven optimization over end-to-end or purely geometric approaches. Limitations remain, including occasional unnatural placements, and future work is proposed for multi-person scenarios, end-user quality ratings, and allowing animation-level adjustments to further enhance realism.

Abstract

We present a novel method for placing a 3D human animation into a 3D scene while maintaining any human-scene interactions in the animation. We use the notion of computing the most important meshes in the animation for the interaction with the scene, which we call "keyframes." These keyframes allow us to better optimize the placement of the animation into the scene such that interactions in the animations (standing, laying, sitting, etc.) match the affordances of the scene (e.g., standing on the floor or laying in a bed). We compare our method, which we call PAAK, with prior approaches, including POSA, PROX ground truth, and a motion synthesis method, and highlight the benefits of our method with a perceptual study. Human raters preferred our PAAK method over the PROX ground truth data 64.6\% of the time. Additionally, in direct comparisons, the raters preferred PAAK over competing methods including 61.5\% compared to POSA.

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

TL;DR

that balances scene affordances against penetration losses to guide placement, with per-frame interaction cues from POSA and a BADGE-based diversity mechanism to select informative frames. Evaluations on the PROX dataset with perceptual user studies show PAAK yields more realistic placements than PROX ground truth and competing baselines, demonstrating the value of keyframe-driven optimization over end-to-end or purely geometric approaches. Limitations remain, including occasional unnatural placements, and future work is proposed for multi-person scenarios, end-user quality ratings, and allowing animation-level adjustments to further enhance realism.

Abstract

Paper Structure (12 sections, 7 equations, 6 figures, 4 tables)

This paper contains 12 sections, 7 equations, 6 figures, 4 tables.

Introduction
Main Contributions
Related Work
Placement of Human Animations into 3D Scenes
Human-Scene Interaction Estimation
Geometric Keyframes
Active Keyframes
Scene Placement with Keyframes
Experiments
Dataset and Baselines
Evaluation
Conclusions, Limitations, and Future Work

Figures (6)

Figure 1: Our goal is to place animations, a 3D sequence of human motion, into a 3D scene while maintaining any interactions with the scene the animation contains. First, we select "keyframes," the most important meshes in the animation for modeling interactions with the scene. In the animation, the leftmost mesh where the human is sitting would be a keyframe. We then use the keyframes to find a placement in the scene that best matches the interactions in the animation (green circles, right).
Figure 2: An overview of our PAAK method. We first estimate human-scene interactions and use those interactions to determine the keyframes in the animation. We can then utilize the keyframes alongside the 3D scene itself to place the animation convincingly into the scene.
Figure 3: Network Architecture. The input animation is n meshes with v vertices and f features each. We utilize four fully connected (FC) layers with the first layer operating across each vertex while the second layer operates across all the vertices in the mesh. The last two layers operate across the entire animation. The model outputs an array of size n, with each index the weight of the corresponding mesh in the animation. The m values are intermediate representations and the FL layers correspond to a flattening of the input along the last two dimensions.
Figure 4: Random samples from our active keyframe framework. The green frames are those with the highest weight in $K_a$. Note that the frames where an important interaction occurs are preferred.
Figure 5: Comparisons on placing the same animation into the same scene across the POSA-T, Geometric Keyframes, and Active Keyframes Methods. Note that two angles of each placement are provided. For Placement 1, only the Active Keyframes method placed the animation on the bed (green circle), allowing for a more reclined seating position that results in standing more upright at the end of the animation. For Placement 2, a jumping action is taking place. Only the Active Keyframes method was able to position the animation such that the hands were not in collision with the back wall.
...and 1 more figures

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

TL;DR

Abstract

Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)