Table of Contents
Fetching ...

HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving

Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Ming-Hsuan Yang

TL;DR

HENet and HENet++ tackle the challenge of delivering high-accuracy, end-to-end multi-task 3D perception for autonomous driving under resource constraints. They introduce a Hybrid Encoding strategy that runs a large encoder on short-term high-resolution frames and a smaller encoder on long-term frames, coupled with dense BEV and sparse instance features processed through independent, task-specific BEV encoders. A dense-sparse collaboration framework and a model-merging pretraining scheme further boost multi-task accuracy, achieving state-of-the-art results on nuScenes for end-to-end perception and the lowest collision rate for end-to-end driving. The approach demonstrates practical gains in perception quality and planning reliability, enabling multimodal (camera and radar) inputs to inform end-to-end trajectory prediction and vehicle control.

Abstract

Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.

HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving

TL;DR

HENet and HENet++ tackle the challenge of delivering high-accuracy, end-to-end multi-task 3D perception for autonomous driving under resource constraints. They introduce a Hybrid Encoding strategy that runs a large encoder on short-term high-resolution frames and a smaller encoder on long-term frames, coupled with dense BEV and sparse instance features processed through independent, task-specific BEV encoders. A dense-sparse collaboration framework and a model-merging pretraining scheme further boost multi-task accuracy, achieving state-of-the-art results on nuScenes for end-to-end perception and the lowest collision rate for end-to-end driving. The approach demonstrates practical gains in perception quality and planning reliability, enabling multimodal (camera and radar) inputs to inform end-to-end trajectory prediction and vehicle control.

Abstract

Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.

Paper Structure

This paper contains 24 sections, 11 equations, 9 figures, 12 tables, 2 algorithms.

Figures (9)

  • Figure 1: HENet++ reduces the training cost of simultaneously using high-resolution images and long-sequence temporal data via Hybrid Encoding. By integrating Hybrid Encoding, Joint Sparse and Dense Encoding, and Pretrain based on Model Merging, HENet++ achieves state-of-the-art multi-task performance while attaining the lowest end-to-end driving collision rate on nuScenes.
  • Figure 2: Overall architecture of HENet. I) Hybrid Image Encoding Network uses image encoders of varying complexity to encode long-sequence frames and short-term images, respectively. II) Temporal Feature Integration module fuses multi-frame features from the various encoders. III) Independent BEV Feature Encoding prepares separate BEV feature maps for different tasks.
  • Figure 3: Architecture of Temporal Feature Integration module. We propose the adjacent frame fusion module (AFFM) and adopt the temporal fusion strategy with temporal backward and forward processes.
  • Figure 4: Design of Independent BEV Feature Encoding. Each task decoder is provided with BEV feature maps in different grid sizes through independent adaptive feature selection and BEV encoding.
  • Figure 5: Overall architecture of HENet++. By simultaneously hybrid encoding for sparse foreground features and dense background voxel features, the framework enables end-to-end multi-task prediction. In addition, we introduce a model-merging-based pre-training strategy that further enhances multi-task performance.
  • ...and 4 more figures