Table of Contents
Fetching ...

Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving

Yuqi Dai, Jian Sun, Shengbo Eben Li, Qing Xu, Jianqiang Wang, Lei He, Keqiang Li

TL;DR

This work tackles the long development cycles and limited reusability in BEV perception for autonomous driving by introducing a hierarchical, decoupled BEV framework with a modular library and graphical interface. It marries a Pretrain-Finetune paradigm with a Multi-Module Learning (MML) approach to train and assemble multiple perception modules, enabling rapid construction and customization of BEV models. Through a vision-centric 3D object detection prototype employing image-view extraction, SCA/GKT-based view transformation, temporal fusion, and a deformable DETR-style head, the method achieves consistent gains on nuScenes across backbone configurations, with notable improvements such as a 2.9% increase in mAP and 4.7% in NDS for certain models. The results advocate for modular, transferable perception architectures that can adapt to evolving platforms, sensor setups, and datasets, offering practical pathways for scalable deployment and continual learning in intelligent driving systems.

Abstract

Perception is essential for autonomous driving system. Recent approaches based on Bird's-eye-view (BEV) and deep learning have made significant progress. However, there exists challenging issues including lengthy development cycles, poor reusability, and complex sensor setups in perception algorithm development process. To tackle the above challenges, this paper proposes a novel hierarchical BEV perception paradigm, aiming to provide a library of fundamental perception modules and user-friendly graphical interface, enabling swift construction of customized models. We conduct the Pretrain-Finetune strategy to effectively utilize large scale public datasets and streamline development processes. Moreover, we present a Multi-Module Learning (MML) approach, enhancing performance through synergistic and iterative training of multiple models. Extensive experimental results on the Nuscenes dataset demonstrate that our approach renders significant improvement over the traditional training scheme.

Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving

TL;DR

This work tackles the long development cycles and limited reusability in BEV perception for autonomous driving by introducing a hierarchical, decoupled BEV framework with a modular library and graphical interface. It marries a Pretrain-Finetune paradigm with a Multi-Module Learning (MML) approach to train and assemble multiple perception modules, enabling rapid construction and customization of BEV models. Through a vision-centric 3D object detection prototype employing image-view extraction, SCA/GKT-based view transformation, temporal fusion, and a deformable DETR-style head, the method achieves consistent gains on nuScenes across backbone configurations, with notable improvements such as a 2.9% increase in mAP and 4.7% in NDS for certain models. The results advocate for modular, transferable perception architectures that can adapt to evolving platforms, sensor setups, and datasets, offering practical pathways for scalable deployment and continual learning in intelligent driving systems.

Abstract

Perception is essential for autonomous driving system. Recent approaches based on Bird's-eye-view (BEV) and deep learning have made significant progress. However, there exists challenging issues including lengthy development cycles, poor reusability, and complex sensor setups in perception algorithm development process. To tackle the above challenges, this paper proposes a novel hierarchical BEV perception paradigm, aiming to provide a library of fundamental perception modules and user-friendly graphical interface, enabling swift construction of customized models. We conduct the Pretrain-Finetune strategy to effectively utilize large scale public datasets and streamline development processes. Moreover, we present a Multi-Module Learning (MML) approach, enhancing performance through synergistic and iterative training of multiple models. Extensive experimental results on the Nuscenes dataset demonstrate that our approach renders significant improvement over the traditional training scheme.
Paper Structure (18 sections, 12 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: An overview of our hierarchical and decoupled BEV perception scheme: i) A perception model library is formed based on multi-module joint training. ii) Perception algorithm models are constructed through drag-and-drop operations using a graphical user interface. iii) Model fine-tuning is performed based on custom data.
  • Figure 2: Training framework overview for Multi-Task Learning and Multi-Module Learning.
  • Figure 3: An overview of the proposed decoupled perception system for autonomous driving vehicles.
  • Figure 4: Architecture of temporal feature fusion modules.
  • Figure 5: Sketch of the proposed Multi-Module Learning pipeline. Taking a 2x2 combination as an example to illustrate the proposed pre-training process for functional modules. First, we conduct a single mini-epoch training session with a mini-epoch size of 3 for each of the various combination models. After this round of training, we perform a parameter fusion and update for the weights of the modules that are common across all models. Then, we continue with further training iterations for optimization. This process is halted after reaching the preset maximum number of training epochs (set to 8 in the experiment), at which point we obtain the final weights for the functional modules.
  • ...and 6 more figures