Table of Contents
Fetching ...

SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

Wenjia Wang, Liang Pan, Zhiyang Dou, Jidong Mei, Zhouyingcheng Liao, Yuke Lou, Yifan Wu, Lei Yang, Jingbo Wang, Taku Komura

TL;DR

SIMS addresses the challenge of generating long-horizon, stylized, physically plausible human-scene interactions by coupling Retrieval-Augmented Script Generation (RASG) with a multi-condition physics-based controller. An LLM-planner builds executable scripts from a short-script database, retrieved and extended via CLIP-guided similarity, while a scene- and text-aware policy realizes motions in a physics simulator under a finite-state machine schedule. The approach is validated on diverse datasets (SAMP, COUCH, AMASS, 3DFront, ViconStyle) and metrics including FID, APD, Success Rate, and Contact Error, with user studies showing superior realism and expressiveness compared with state-of-the-art baselines. The results demonstrate improved skill coverage, diversity, and physical coherence, and the work provides scalable paths for adding new skills and styles through script databases and policy training. Overall, SIMS offers a practical, extensible framework for controllable, long-term stylized HSI with potential applications in animation, robotics, and embodied AI.

Abstract

Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.

SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

TL;DR

SIMS addresses the challenge of generating long-horizon, stylized, physically plausible human-scene interactions by coupling Retrieval-Augmented Script Generation (RASG) with a multi-condition physics-based controller. An LLM-planner builds executable scripts from a short-script database, retrieved and extended via CLIP-guided similarity, while a scene- and text-aware policy realizes motions in a physics simulator under a finite-state machine schedule. The approach is validated on diverse datasets (SAMP, COUCH, AMASS, 3DFront, ViconStyle) and metrics including FID, APD, Success Rate, and Contact Error, with user studies showing superior realism and expressiveness compared with state-of-the-art baselines. The results demonstrate improved skill coverage, diversity, and physical coherence, and the work provides scalable paths for adding new skills and styles through script databases and policy training. Overall, SIMS offers a practical, extensible framework for controllable, long-term stylized HSI with potential applications in animation, robotics, and embodied AI.

Abstract

Simulating stylized human-scene interactions (HSI) in physical environments is a challenging yet fascinating task. Prior works emphasize long-term execution but fall short in achieving both diverse style and physical plausibility. To tackle this challenge, we introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy, enabling more expressive and diverse human-scene interactions. Specifically, we employ Large Language Models with Retrieval-Augmented Generation (RAG) to generate coherent and diverse long-form scripts, providing a rich foundation for motion planning. A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues, simultaneously perceiving environmental geometries and accomplishing task goals. By integrating the retrieval-augmented script generation with the multi-condition controller, our approach provides a unified solution for generating stylized HSI motions. We further introduce a comprehensive planning dataset produced by RAG and a stylized motion dataset featuring diverse locomotions and interactions. Extensive experiments demonstrate SIMS's effectiveness in executing various tasks and generalizing across different scenarios, significantly outperforming previous methods.

Paper Structure

This paper contains 38 sections, 3 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: SIMS enables physically simulated characters to perform diverse skills within complex 3D scenes given long-term daily narratives and scene inputs. Our character could perform versatile skills, including Locomotions, Human Scene Interactions and Dynamic Object Interactions with diverse styles while accomplishing physically plausible contacts and obstacle avoidance. Left: a dialogue-based retrieval-augmented script generation process. Right: a skillful humanoid performing diverse stylized interactions in a 3D scene.
  • Figure 2: (a) Our main pipeline. We prompt LLMs to generate new short scripts following their emotion and interaction logic. The retrieval process includes 2 stages. We first retrieve the top-k short script with semantics similarity, then ask LLM to retrieve useful samples from the short scripts and concatenate them as a fluent long-term story. In the Finite State Machine. We parse skills, captions, and scene geometry from each keyframe into task goals, language embeddings, and heightmap conditions to drive the low-level physical control policy. (c) The multi-condition physics policy. We divide common skills into 3 categories: Lococmotion, HSI, and DOI. Skills in the same category share similar task observations and reward computations.
  • Figure 3: Long-term scripts with detailed keyframes and vivid final stories in two complex 3D scenes generated by our complete system. Upper: character in the bedroom and living room. Lower: character in the living room, dining room, and study room. We briefly demonstrate the retrieved summaries, key frames and part of the final long stories.
  • Figure 4: Qualitative results for skills with different text conditions.
  • Figure 5: ViconStyle demos.
  • ...and 3 more figures