Table of Contents
Fetching ...

Enactor: From Traffic Simulators to Surrogate World Models

Yash Ranjan, Rahul Sengupta, Anand Rangarajan, Sanjay Ranka

Abstract

Traffic microsimulators are widely used to evaluate road network performance under various ``what-if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor-actor interactions. Deep learning-based methods have been applied to model vehicles and pedestrians as ``agents" responding to their surrounding ``environment" (including lanes, signals, and neighboring agents). Although effective in learning actor-actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer-based architecture that is able to capture the actor-actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live ``simulation-in-the-loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor-actor interactions and generates long-horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent-centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL-Divergence.

Enactor: From Traffic Simulators to Surrogate World Models

Abstract

Traffic microsimulators are widely used to evaluate road network performance under various ``what-if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor-actor interactions. Deep learning-based methods have been applied to model vehicles and pedestrians as ``agents" responding to their surrounding ``environment" (including lanes, signals, and neighboring agents). Although effective in learning actor-actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer-based architecture that is able to capture the actor-actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live ``simulation-in-the-loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor-actor interactions and generates long-horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent-centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL-Divergence.
Paper Structure (15 sections, 15 equations, 2 figures, 6 tables)

This paper contains 15 sections, 15 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Transformer-based architecture for learning multi-actor interactions and generating physically consistent trajectories. (a) The SUMO simulator initializes the traffic scene based on a set of control parameters. For each vehicle, neighboring actors are identified and their states are vectorized. (b) A state encoder maps the actor and its neighbors into embeddings, which are processed by a spatial attention module to produce interaction-aware spatial embeddings over the previous $H$ timesteps. (c) The sequence of spatial embeddings for each actor is combined with a learned query embedding and passed through a temporal attention module to obtain a temporally informed representation, which is used to parameterize a Gaussian distribution over the actor’s next state. During inference, this distribution is sampled to generate the next state. (d) The newly generated actor state, together with updated neighbor states, is re-encoded and fed back into the model, enabling recursive trajectory unrolling in closed loop.
  • Figure 2: The digital representation of two traffic intersection. (a) is the SUMO representation of the intersection at West Univ Ave @ NW 17th Street in Gainesville, Florida. (b) is the same intersection with the middle lane of the east and west flowing traffic removed.