HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving
Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Ming-Hsuan Yang
TL;DR
HENet and HENet++ tackle the challenge of delivering high-accuracy, end-to-end multi-task 3D perception for autonomous driving under resource constraints. They introduce a Hybrid Encoding strategy that runs a large encoder on short-term high-resolution frames and a smaller encoder on long-term frames, coupled with dense BEV and sparse instance features processed through independent, task-specific BEV encoders. A dense-sparse collaboration framework and a model-merging pretraining scheme further boost multi-task accuracy, achieving state-of-the-art results on nuScenes for end-to-end perception and the lowest collision rate for end-to-end driving. The approach demonstrates practical gains in perception quality and planning reliability, enabling multimodal (camera and radar) inputs to inform end-to-end trajectory prediction and vehicle control.
Abstract
Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.
