Table of Contents
Fetching ...

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

Fabian Immel, Jan-Hendrik Pauls, Richard Fehler, Frank Bieder, Jonas Merkert, Christoph Stiller

TL;DR

SDTagNet tackles the high maintenance cost of HD maps by using widely available SD maps, especially textual annotations from OpenStreetMap, to improve online HD map construction. It combines an NLP tag embedding module (based on a compact BERT) with a point-level SD map encoder that uses orthogonal random feature identifiers, integrating these priors into the HD map decoder via cross-attention. The approach achieves up to $+$5.9 $mAP$ improvements on Argoverse 2 and $+$4.1 $mAP$ on nuScenes in far-range perception, outperforming prior SD-map methods and enabling real-time deployment with modest overhead. By leveraging open, text-rich SD maps and self-supervised pretraining, SDTagNet offers a scalable path to more robust long-range map perception across diverse environments.

Abstract

Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

TL;DR

SDTagNet tackles the high maintenance cost of HD maps by using widely available SD maps, especially textual annotations from OpenStreetMap, to improve online HD map construction. It combines an NLP tag embedding module (based on a compact BERT) with a point-level SD map encoder that uses orthogonal random feature identifiers, integrating these priors into the HD map decoder via cross-attention. The approach achieves up to 5.9 improvements on Argoverse 2 and 4.1 on nuScenes in far-range perception, outperforming prior SD-map methods and enabling real-time deployment with modest overhead. By leveraging open, text-rich SD maps and self-supervised pretraining, SDTagNet offers a scalable path to more robust long-range map perception across diverse environments.

Abstract

Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet

Paper Structure

This paper contains 34 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Overview of the model architecture of SDTagNet. To fully exploit textual annotations and all element types in large public SD map databases like OpenStreetMap OpenStreetMap, SDTagNet introduces novel NLP tag embedding and SD map encoder modules. Text annotation embeddings are first computed with a BERT devlin-etal-2019-bert embedding model. They are then fused with scene-level context in a SD map encoder, which uses graph transformer-like methods to flexibly encode points, polylines and element relations. The encoded information is finally supplied to the base model via cross-attention.
  • Figure 2: Visualization of the SD map prior input data utilized by existing methods. Existing approaches are limited to rasterized images or polylines with manually defined classes. SDTagNet is the first method that can handle open-vocabulary textual annotations and diverse element types such as points, polylines, and relational information.
  • Figure 3: Example of the tag embedding contrastive pretraining objective. A positive sample is selected from tagsets with the same semantically meaningful tags, but different not meaningful ones (like the street name). Negative samples are selected from all other unique tagsets. The number of negative samples in practice is much larger than depicted here to prevent unstable training.
  • Figure 4: Detailed design of the SD map encoder and its queries. Each point query is composed of the positional sin/cos encoding of the point, the respective tag embedding and orthogonal random features (ORF) orf2016nips that function as element identifiers and can model graph edges.
  • Figure 5: Qualitative comparison of SDTagNet with PMapNet (all info.) and SMERF (all info.) on Argoverse 2 in the far range setting. Both SMERF and PMapNet fail to identify the one-way road and hallucinate a standard two-way crossing topology instead. SDTagNet can translate the information in the SD map tags to a correct one-way topology prediction.
  • ...and 5 more figures