Table of Contents
Fetching ...

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

Xuesong Chen, Shaoshuai Shi, Tao Ma, Jingqiu Zhou, Simon See, Ka Chun Cheung, Hongsheng Li

TL;DR

M3Net addresses the need for full-perception in autonomous driving by unifying 3D object detection, BEV map segmentation, and 3D occupancy prediction within a single multimodal framework. It introduces modality-adaptive feature integration (MAFI) and a task-oriented channel scaling (TCS) mechanism to fuse LiDAR and image information while mitigating cross-task gradient conflicts, and supports both Transformer and Mamba-based decoders. The approach uses BEV-based query initialization tailored to each task, and a shared BEV decoder with task-specific channels to enable efficient multi-task learning. Across nuScenes and OpenOccupancy benchmarks, M3Net delivers state-of-the-art multi-task performance, with substantial gains in mIoU and IoU for occupancy, and competitive improvements in detection and segmentation, validating its effectiveness and architectural flexibility for full perception in autonomous driving.

Abstract

The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

TL;DR

M3Net addresses the need for full-perception in autonomous driving by unifying 3D object detection, BEV map segmentation, and 3D occupancy prediction within a single multimodal framework. It introduces modality-adaptive feature integration (MAFI) and a task-oriented channel scaling (TCS) mechanism to fuse LiDAR and image information while mitigating cross-task gradient conflicts, and supports both Transformer and Mamba-based decoders. The approach uses BEV-based query initialization tailored to each task, and a shared BEV decoder with task-specific channels to enable efficient multi-task learning. Across nuScenes and OpenOccupancy benchmarks, M3Net delivers state-of-the-art multi-task performance, with substantial gains in mIoU and IoU for occupancy, and competitive improvements in detection and segmentation, validating its effectiveness and architectural flexibility for full perception in autonomous driving.

Abstract

The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overall architecture of our proposed M3Net as well as the detailed design of (a) Modality-adaptive Feature Integration module and (b) BEV-based multi-task query initialization.
  • Figure 2: The detailed architectures of our transformer-based and mamba-based decoder layer with the task-oriented channel scaling module. DSA, DCA and VSS2D denote deformable self-attention, cross-attention and the VSS2D block from Vmamba liu2024vmamba.