Table of Contents
Fetching ...

T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

Changsheng Lv, Mengshi Qi, Liang Liu, Huadong Ma

TL;DR

This work introduces the Traffic Topology Scene Graph ($\text{T}^2\text{SG}$) to unify lane topology and road-signal semantics for autonomous driving. It proposes TopoFormer, a one-stage transformer with a Lane Aggregation Layer that uses geometry-guided self-attention and a Counterfactual Intervention Layer to model realistic road structures, improving topology reasoning. Training optimizes node detection and edge prediction while leveraging the total indirect effect (TIE) of counterfactual attention, and inference uses standard edge predictions without TIE. Experiments on OpenLane-V2 show state-of-the-art performance in $\text{T}^2\text{SG}$ generation and substantial gains in traffic topology reasoning, with an OpenLane-V2 Score of $46.3$ on subset_A, demonstrating strong practical impact for HD map construction and downstream planning tasks.

Abstract

Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.

T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

TL;DR

This work introduces the Traffic Topology Scene Graph () to unify lane topology and road-signal semantics for autonomous driving. It proposes TopoFormer, a one-stage transformer with a Lane Aggregation Layer that uses geometry-guided self-attention and a Counterfactual Intervention Layer to model realistic road structures, improving topology reasoning. Training optimizes node detection and edge prediction while leveraging the total indirect effect (TIE) of counterfactual attention, and inference uses standard edge predictions without TIE. Experiments on OpenLane-V2 show state-of-the-art performance in generation and substantial gains in traffic topology reasoning, with an OpenLane-V2 Score of on subset_A, demonstrating strong practical impact for HD map construction and downstream planning tasks.

Abstract

Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.

Paper Structure

This paper contains 17 sections, 17 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An example of a traffic scene understanding is illustrated as follows: (a) The traffic scene. (b) A BEV of the traffic scene, (c) The topology relationship proposed in TopoNet li2023graph, (d) The $\text{T}^2\text{SG}$ proposed in our work. Different from TopoNet, $\text{T}^2\text{SG}$ can simultaneously model the whole relationships in the scene graph.
  • Figure 2: The overview of our proposed TopoFormer. Given the input multi-view images, we employ a DETR-like detector to identify lane objects with corresponding class and centerline coordinates. Subsequently, TopoFormer infers the relationships among these objects, which, along with the objects themselves, constitute the $\text{T}^2\text{SG}$. The main components of TopoFormer include two newly designed layers: (a) the Counterfactual Intervention Layer incorporating Counterfactual Self-Attention, and (b) the Lane Aggregation Layer incorporating Geometric-guided Self-Attention. Ultimately, the output of TopoFormer is a traffic topology scene graph, encapsulating the topological relationships among lanes, and guided by various road signals associated with the lanes.
  • Figure 3: Qualitative results of the $\text{T}^2\text{SG}$ generation task and the lane topology reasoning, comparing the performance of TopoNet li2023graph and our proposed TopoFormer. The first row represents multi-view inputs. The second row illustrates the results of lane detection and lane topology reasoning. The third row visualizes our defined $\text{T}^2\text{SG}$, with TopoNet's results converted to the same format for comparison. In these visualizations, green signifies correct predictions, red denotes erroneous predictions, and blue indicates missing predictions.