Table of Contents
Fetching ...

CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation

Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, Ziwei Liu

TL;DR

CrowdMoGen tackles the challenge of generating realistic, event-driven crowd motions from text by decoupling scene planning from motion synthesis. A crowd scene planner uses GPT-4 to group individuals and assign scene-guided and event-driven activities, while a SMPL-based joint prior provides low-level motion cues. The collective motion generator then injects 3D spatial controls into a transformer-based diffusion model, employing InputMixing and a joint-specific ControlAttention architecture, along with targeted losses to enforce realism and spatial fidelity. Extensive experiments on HumanML3D and KIT-ML demonstrate improved spatial coherence, event alignment, and motion quality over prior methods, highlighting strong potential for urban simulation and large-scale interactive environments.

Abstract

While recent advances in text-to-motion generation have shown promising results, they typically assume all individuals are grouped as a single unit. Scaling these methods to handle larger crowds and ensuring that individuals respond appropriately to specific events remains a significant challenge. This is primarily due to the complexities of scene planning, which involves organizing groups, planning their activities, and coordinating interactions, and controllable motion generation. In this paper, we present CrowdMoGen, the first zero-shot framework for collective motion generation, which effectively groups individuals and generates event-aligned motion sequences from text prompts. 1) Being limited by the available datasets for training an effective scene planning module in a supervised manner, we instead propose a crowd scene planner that leverages pre-trained large language models (LLMs) to organize individuals into distinct groups. While LLMs offer high-level guidance for group divisions, they lack the low-level understanding of human motion. To address this, we further propose integrating an SMPL-based joint prior to generate context-appropriate activities, which consists of both joint trajectories and textual descriptions. 2) Secondly, to incorporate the assigned activities into the generative network, we introduce a collective motion generator that integrates the activities into a transformer-based network in a joint-wise manner, maintaining the spatial constraints during the multi-step denoising process. Extensive experiments demonstrate that CrowdMoGen significantly outperforms previous approaches, delivering realistic, event-driven motion sequences that are spatially coherent. As the first framework of collective motion generation, CrowdMoGen has the potential to advance applications in urban simulation, crowd planning, and other large-scale interactive environments.

CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation

TL;DR

CrowdMoGen tackles the challenge of generating realistic, event-driven crowd motions from text by decoupling scene planning from motion synthesis. A crowd scene planner uses GPT-4 to group individuals and assign scene-guided and event-driven activities, while a SMPL-based joint prior provides low-level motion cues. The collective motion generator then injects 3D spatial controls into a transformer-based diffusion model, employing InputMixing and a joint-specific ControlAttention architecture, along with targeted losses to enforce realism and spatial fidelity. Extensive experiments on HumanML3D and KIT-ML demonstrate improved spatial coherence, event alignment, and motion quality over prior methods, highlighting strong potential for urban simulation and large-scale interactive environments.

Abstract

While recent advances in text-to-motion generation have shown promising results, they typically assume all individuals are grouped as a single unit. Scaling these methods to handle larger crowds and ensuring that individuals respond appropriately to specific events remains a significant challenge. This is primarily due to the complexities of scene planning, which involves organizing groups, planning their activities, and coordinating interactions, and controllable motion generation. In this paper, we present CrowdMoGen, the first zero-shot framework for collective motion generation, which effectively groups individuals and generates event-aligned motion sequences from text prompts. 1) Being limited by the available datasets for training an effective scene planning module in a supervised manner, we instead propose a crowd scene planner that leverages pre-trained large language models (LLMs) to organize individuals into distinct groups. While LLMs offer high-level guidance for group divisions, they lack the low-level understanding of human motion. To address this, we further propose integrating an SMPL-based joint prior to generate context-appropriate activities, which consists of both joint trajectories and textual descriptions. 2) Secondly, to incorporate the assigned activities into the generative network, we introduce a collective motion generator that integrates the activities into a transformer-based network in a joint-wise manner, maintaining the spatial constraints during the multi-step denoising process. Extensive experiments demonstrate that CrowdMoGen significantly outperforms previous approaches, delivering realistic, event-driven motion sequences that are spatially coherent. As the first framework of collective motion generation, CrowdMoGen has the potential to advance applications in urban simulation, crowd planning, and other large-scale interactive environments.
Paper Structure (16 sections, 9 equations, 5 figures, 5 tables)

This paper contains 16 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: CrowdMoGen is a zero-shot, text-driven framework that enables generalizable planning and generation of crowd motions. Given a scene context, we aim to generate realistic crowd motions that fit the scene settings.
  • Figure 2: Overview of CrowdMoGen. The CrowdMoGen framework comprises two main components: 1) Crowd Scene Planner, which uses a Large Language Model (LLM) to interpret and arrange crowd motions based on textual requirements from the user. This component then provides unified control signals in both textual and spatial formats. 2) Collective Motion Generator, which leverages these control signals to manipulate and generate realistic individual motions.
  • Figure 3: Scene-guided activities and event-driven activities. The Crowd Scene Planner is able to deal with crowd motions effectively at both the scene and motion levels. It manages both scene-guided and event-driven activities, ensuring realistic and coherent crowd scenarios. Best viewed in PDF with zoom-in.
  • Figure 4: Qualitative Visualizations. Displayed are selected frames from the crowd motion sequences generated by our proposed CrowdMoGen. It effectively creates scenarios involving multi-person close interactions (a)-(b), dynamic crowd movements (c)-(d), and complex crowd scenes (e)-(f) that accurately and naturally reflect the specified scene descriptions.
  • Figure 5: User Study: Comparative Analysis of Planning Methods. This chart presents the comparison results between our Crowd Scene Planner and plain GPT-4, based on participant preferences for text-motion consistency (TM. Con.) and motion quality (M. Qual.). The percentages reflect the proportion of participants who favored each method.