Table of Contents
Fetching ...

Efficient Equivariant Transformer for Self-Driving Agent Modeling

Scott Xu, Dian Chen, Kelvin Wong, Chris Zhang, Kion Fallah, Raquel Urtasun

Abstract

Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra $\mathbb{R}^*_{2,0,1}$ and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.

Efficient Equivariant Transformer for Self-Driving Agent Modeling

Abstract

Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.

Paper Structure

This paper contains 43 sections, 11 theorems, 36 equations, 5 figures, 4 tables.

Key Result

Proposition 1

The result of applying an operator $u$ to an object $x$ is given by a sandwich product $u[x]:=uxu^{-1}$. $\blacktriangleleft$$\blacktriangleleft$

Figures (5)

  • Figure 1: The DriveGATr architecture. Figure \ref{['pipeline_overview']} provides an overview. The poses and features of $N_{actor}$ agents and $N_{map}$ nodes in each scene are encoded as multivectors in ${\mathbb{R}^*_{2,0,1}}$ and scalars. These tensors are processed by N transformer blocks, each consisting of agent and map cross attention, temporal causal self-attention, equivariant MLPs and invariant adapters. Each of these modules contain skip connections, closely mimicking a standard transformer. Figure \ref{['attention_overview']} describes the attention block. For all attention blocks, the query inputs are agent of interests. The key and value inputs are agents and map nodes for the cross attentions, attending across elements per-timestep. For self-attention, attention spans the temporal axis per-agent. Figure \ref{['mlp_overview']} describes the equivariant MLP block. Table \ref{['tbl:encodings']} provides details on ${\mathbb{R}^*_{2,0,1}}$ encodings. Section \ref{['sec:equivariant_layers']} provides details on each of the equivariant primitive layers. Section \ref{['sec:invariant_adapter']} describes the invariant adapter block.
  • Figure 2: Training curves. We compare the envelope of minimal training loss per FLOP (left), compute efficiency as the number of agents scale (middle), and sample efficiency as the training dataset grows (right). Compared to the baselines, DriveGATr establishes a superior Pareto front in performance and computation cost thanks to its compute and sample efficiency.
  • Figure 3: Robustness to roto-translations. We compare Transformer (left), Transformer + DRoPE (middle), and DriveGATr (right)'s robustness to roto-translations. In each figure, we overlay rollouts from the original coordinate frame vs one rotated by 90° and translated by 100m forward. Blue trajectories visualize model predictions in the original input, and red visualize predictions in the transformed scene. DriveGATr produces consistent trajectories despite closed-loop execution, demonstrating its robustness to roto-translations.
  • Figure 4: Robustness to roto-translations. Additional visualizations showing Transformer (left), Transformer + DRoPE (middle), and DriveGATr (right)'s robustness to roto-translations. In each figure, we overlay rollouts from the original coordinate frame vs one rotated by 90° and translated by 100m forward. Blue trajectories visualize model predictions in the original input, and red visualize predictions in the transformed scene.
  • Figure 5: Visualizations of multi-modal DriveGATr rollouts. In each figure, we overlay trajectories from multiple ($k=8$) rollouts. Different actors of the same color corresponds to the same rollout.

Theorems & Definitions (23)

  • Proposition 1
  • Remark
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • ...and 13 more