Table of Contents
Fetching ...

EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction

Bingxue Zhao, Qi Zhang, Hui Huang

Abstract

Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbf{EnvSocial-Diff}: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual--group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual--group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: https://github.com/zqyq/EnvSocial-Diff.

EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction

Abstract

Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbf{EnvSocial-Diff}: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual--group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual--group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: https://github.com/zqyq/EnvSocial-Diff.
Paper Structure (16 sections, 24 equations, 7 figures, 16 tables)

This paper contains 16 sections, 24 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Environmental factors are important in crowd simulation. The target pedestrian is influenced by nearby neighbors, obstacles, objects of interest (OOI), and lighting conditions. The scene image is divided into grids to calculate lighting information.
  • Figure 2: EnvSocial-Diff pipeline. Pedestrian motion is modeled as in the Social Force Model (SFM), where the destination force $\vec{F}_i^{\,\text{dest}}$ is applied outside the diffusion process to preserve long-term intent. The conditioning signals $c_i^t = [\vec{F}_i^{\,\text{env}} \oplus \vec{F}_i^{\,\text{social}} \oplus \vec{F}_i^{\,\text{hist}}]$ aggregate three interactive components: (1) Environmental Conditioning — obstacle and OOI features are encoded via cross-attention with pedestrians, while lighting features are extracted from grid-based scene brightness; (2) Individual–Group Interactions — GNNs encode individual-level (sim$^1_{ij}$, sim$^2_{ij}$), group-level (sim$^3_i$), and relative motion ($r_{ij}$) to produce the social force $\vec{F}_i^{\,\text{social}}$; and (3) Historical Trajectories — short-term motion trends are encoded from recent states using an LSTM. Given $c_i^t$ and Gaussian noise $\boldsymbol{\epsilon}\!\sim\!\mathcal{N}(0,1)$, the denoiser $f_\theta$ performs reverse diffusion to recover clean accelerations $\hat{\mathbf{y}}^{\,t}_{i,0}$, which are then combined with the destination force to yield the final prediction $\hat{\vec{a}}^{\,t}_i$.
  • Figure 3: Comparison with baselines on UCY. (A) Predicted trajectories: our method (cyan) follows the ground truth (blue) more closely than SFM (magenta) and SPDiff (orange). (B) Error curves over time: our method consistently achieves lower MAE and OT, especially at longer horizons.
  • Figure 4: Comparison between the original GC subregion and the full GC scene. The left image highlights the cropped subarea (blue box) used in prior work, which limits spatial and interaction diversity. The right image shows the complete GC scene, covering a broader area with higher pedestrian density and environmental complexity, used in our extended evaluation.
  • Figure 5: Qualitative comparison between the GT, Ours, and SPDiff. (A) Near obstacles, SPDiff trajectories maybe to pass much closer to obstacles, whereas both GT and our method keep a more reasonable distance. (B) In crowded regions, SPDiff produces several near-collision interactions, while our predictions remain smoother and more socially consistent. (C)
  • ...and 2 more figures