M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
Xuesong Chen, Shaoshuai Shi, Tao Ma, Jingqiu Zhou, Simon See, Ka Chun Cheung, Hongsheng Li
TL;DR
M3Net addresses the need for full-perception in autonomous driving by unifying 3D object detection, BEV map segmentation, and 3D occupancy prediction within a single multimodal framework. It introduces modality-adaptive feature integration (MAFI) and a task-oriented channel scaling (TCS) mechanism to fuse LiDAR and image information while mitigating cross-task gradient conflicts, and supports both Transformer and Mamba-based decoders. The approach uses BEV-based query initialization tailored to each task, and a shared BEV decoder with task-specific channels to enable efficient multi-task learning. Across nuScenes and OpenOccupancy benchmarks, M3Net delivers state-of-the-art multi-task performance, with substantial gains in mIoU and IoU for occupancy, and competitive improvements in detection and segmentation, validating its effectiveness and architectural flexibility for full perception in autonomous driving.
Abstract
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
