Table of Contents
Fetching ...

DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction

Siyu Li, Jiacheng Lin, Hao Shi, Jiaming Zhang, Song Wang, You Yao, Zhiyong Li, Kailun Yang

TL;DR

This work addresses the sparsity of visual information in vectorized HD map construction by introducing DTCLMapper, a dual temporal consistency framework. It combines Instance Consistent Learning (via VPPSM and AIFCL) and Map Consistent Learning (via MO Loss on grid maps) to preserve and propagate reliable instance information across frames without indiscriminate feature fusion. The approach achieves state-of-the-art mAP on nuScenes (61.9%) and Argoverse (65.1%), validating the effectiveness of temporal instance and map consistency for robust HD map construction. The method offers practical benefits for online autonomous driving, including improved regression efficiency and better handling of occlusion and sparsity, with potential for wider multi-frame map fusion in the future.

Abstract

Temporal information plays a pivotal role in Bird's-Eye-View (BEV) driving scene understanding, which can alleviate the visual information sparsity. However, the indiscriminate temporal fusion method will cause the barrier of feature redundancy when constructing vectorized High-Definition (HD) maps. In this paper, we revisit the temporal fusion of vectorized HD maps, focusing on temporal instance consistency and temporal map consistency learning. To improve the representation of instances in single-frame maps, we introduce a novel method, DTCLMapper. This approach uses a dual-stream temporal consistency learning module that combines instance embedding with geometry maps. In the instance embedding component, our approach integrates temporal Instance Consistency Learning (ICL), ensuring consistency from vector points and instance features aggregated from points. A vectorized points pre-selection module is employed to enhance the regression efficiency of vector points from each instance. Then aggregated instance features obtained from the vectorized points preselection module are grounded in contrastive learning to realize temporal consistency, where positive and negative samples are selected based on position and semantic information. The geometry mapping component introduces Map Consistency Learning (MCL) designed with self-supervised learning. The MCL enhances the generalization capability of our consistent learning approach by concentrating on the global location and distribution constraints of the instances. Extensive experiments on well-recognized benchmarks indicate that the proposed DTCLMapper achieves state-of-the-art performance in vectorized mapping tasks, reaching 61.9% and 65.1% mAP scores on the nuScenes and Argoverse datasets, respectively. The source code is available at https://github.com/lynn-yu/DTCLMapper.

DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction

TL;DR

This work addresses the sparsity of visual information in vectorized HD map construction by introducing DTCLMapper, a dual temporal consistency framework. It combines Instance Consistent Learning (via VPPSM and AIFCL) and Map Consistent Learning (via MO Loss on grid maps) to preserve and propagate reliable instance information across frames without indiscriminate feature fusion. The approach achieves state-of-the-art mAP on nuScenes (61.9%) and Argoverse (65.1%), validating the effectiveness of temporal instance and map consistency for robust HD map construction. The method offers practical benefits for online autonomous driving, including improved regression efficiency and better handling of occlusion and sparsity, with potential for wider multi-frame map fusion in the future.

Abstract

Temporal information plays a pivotal role in Bird's-Eye-View (BEV) driving scene understanding, which can alleviate the visual information sparsity. However, the indiscriminate temporal fusion method will cause the barrier of feature redundancy when constructing vectorized High-Definition (HD) maps. In this paper, we revisit the temporal fusion of vectorized HD maps, focusing on temporal instance consistency and temporal map consistency learning. To improve the representation of instances in single-frame maps, we introduce a novel method, DTCLMapper. This approach uses a dual-stream temporal consistency learning module that combines instance embedding with geometry maps. In the instance embedding component, our approach integrates temporal Instance Consistency Learning (ICL), ensuring consistency from vector points and instance features aggregated from points. A vectorized points pre-selection module is employed to enhance the regression efficiency of vector points from each instance. Then aggregated instance features obtained from the vectorized points preselection module are grounded in contrastive learning to realize temporal consistency, where positive and negative samples are selected based on position and semantic information. The geometry mapping component introduces Map Consistency Learning (MCL) designed with self-supervised learning. The MCL enhances the generalization capability of our consistent learning approach by concentrating on the global location and distribution constraints of the instances. Extensive experiments on well-recognized benchmarks indicate that the proposed DTCLMapper achieves state-of-the-art performance in vectorized mapping tasks, reaching 61.9% and 65.1% mAP scores on the nuScenes and Argoverse datasets, respectively. The source code is available at https://github.com/lynn-yu/DTCLMapper.
Paper Structure (16 sections, 23 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 16 sections, 23 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Difference between current temporal fusion and the proposed DTCLMapper. (a) depicts an example of the current BEV temporal fusion study beverse where features are enhanced through temporal fusion for intermediate BEV feature layers. (b) depicts the proposed DTCLMapper for BEV HD map construction. It can enrich road information from historical instance features using a consistency learning approach.
  • Figure 2: Analysis of temporal overlapping and results of different temporal methods. Successive perspective images and coarse BEV road images through Inverse Perspective Mapping (IPM) are presented on the top left, whereas on the right are the differential results of successive frames. It can be observed that the BEV image in the short temporal range has a small difference. The middle and bottom sub-figures show comparison results of different temporal fusion methods. Intuitively, from the perspectives of point regression efficiency, map generation quality, and accuracy of three levels, the proposed temporal learning has excellent performance for HD map construction.
  • Figure 3: Overview of the proposed DTCLMapper architecture. It consists of the multi-view image backbone, view transformer, BEV decoder, and multi-heads. To alleviate the sparsity of visual information, a dual temporal consistent learning module is introduced, namely, Instance Consistent Learning (ICL) and Map Consistent Learning (MCL). ICL is composed of a Vector Point PreSelection Module (VPPSM) and Aggregated Instance Feature Consistent Learning (AIFCL). MCL proposes a Map Occupancy Loss (MO Loss) based on a grid map.
  • Figure 4: Diagram of the Vectorized Points PreSelection Module (VPPSM). In simple terms, an instance map is learned from BEV features. The coarse vector points are selected in each instance map according to the key point sampling principle. The instance features and the geometric positions of these selected vector points are encoded as a point query.
  • Figure 5: Visualization on the nuScenes validation dataset. From left to right are multi-view perspective images, GT, MapTR maptr, MapTRv2 maptrv2, and our work. The last two columns are grid maps rasterized from the vectorized map and grid maps merged from temporal grid maps.
  • ...and 6 more figures