Table of Contents
Fetching ...

Learned Multimodal Compression for Autonomous Driving

Hadi Hadizadeh, Ivan V. Bajić

TL;DR

This work tackles the heavy data burden in autonomous driving by proposing learned multimodal compression for camera and LiDAR aimed at preserving 3D object detection performance. Building on FUTR3D, it compares three coding strategies that either jointly compress fused features or conditionally compress one modality based on the other, using ANFIC and CANF codecs to operate on feature maps after modality fusion. The evaluation on nuScenes shows that joint coding of fused features achieves the best rate-accuracy trade-off, with substantial rate reductions (e.g., ~67.7% BD-Rate improvement over a baseline) and the ability to reach no-compression mAP at about 21 KB per sample, highlighting strong task-relevant information preservation. These results demonstrate the practical potential of learned multimodal, coding-for-machines approaches for efficient onboard and edge-cloud autonomous driving systems, enabling real-time decision-making with significantly reduced data transfer and computation.

Abstract

Autonomous driving sensors generate an enormous amount of data. In this paper, we explore learned multimodal compression for autonomous driving, specifically targeted at 3D object detection. We focus on camera and LiDAR modalities and explore several coding approaches. One approach involves joint coding of fused modalities, while others involve coding one modality first, followed by conditional coding of the other modality. We evaluate the performance of these coding schemes on the nuScenes dataset. Our experimental results indicate that joint coding of fused modalities yields better results compared to the alternatives.

Learned Multimodal Compression for Autonomous Driving

TL;DR

This work tackles the heavy data burden in autonomous driving by proposing learned multimodal compression for camera and LiDAR aimed at preserving 3D object detection performance. Building on FUTR3D, it compares three coding strategies that either jointly compress fused features or conditionally compress one modality based on the other, using ANFIC and CANF codecs to operate on feature maps after modality fusion. The evaluation on nuScenes shows that joint coding of fused features achieves the best rate-accuracy trade-off, with substantial rate reductions (e.g., ~67.7% BD-Rate improvement over a baseline) and the ability to reach no-compression mAP at about 21 KB per sample, highlighting strong task-relevant information preservation. These results demonstrate the practical potential of learned multimodal, coding-for-machines approaches for efficient onboard and edge-cloud autonomous driving systems, enabling real-time decision-making with significantly reduced data transfer and computation.

Abstract

Autonomous driving sensors generate an enormous amount of data. In this paper, we explore learned multimodal compression for autonomous driving, specifically targeted at 3D object detection. We focus on camera and LiDAR modalities and explore several coding approaches. One approach involves joint coding of fused modalities, while others involve coding one modality first, followed by conditional coding of the other modality. We evaluate the performance of these coding schemes on the nuScenes dataset. Our experimental results indicate that joint coding of fused modalities yields better results compared to the alternatives.
Paper Structure (6 sections, 3 equations, 5 figures, 2 tables)

This paper contains 6 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The simplified architecture of FUTR3D. In our experiments, $\mathcal{Y}^{(1)}$ and $\mathcal{Y}^{(2)}$ are camera and LiDAR features after feature sampling, which are two single-channel feature maps of the same spatial size. $\mathcal{Z}$ is the fused modality. When using 600 queries with an embedding length of 256, these feature maps are of size $1\times 600\times 256$.
  • Figure 2: The architecture of the first coding approach (Approach 1). In this approach, the fused modality, $\mathcal{Z}$, is first encoded by the ANFIC's encoder (Enc) to obtain a compressed bitstream. The bitstream is then decoded by the ANFIC's decoder (Dec) to reconstruct $\hat{\mathcal{Z}}$, which is then fed to the transformer decoder to perform object detection.
  • Figure 3: The architecture of the second coding approach (Approach 2). In this approach, the camera modality, $\mathcal{Y}^{(1)}$, is first encoded by the ANFIC's encoder (Enc) to obtain a compressed bitstream. The bitstream is then decoded by the ANFIC's decoder (Dec) to reconstruct $\hat{\mathcal{Y}}^{(1)}$. The LiDAR modality, $\mathcal{Y}^{(2)}$, is encoded by CANF conditioned on $\hat{\mathcal{Y}}^{(1)}$ to produce another bitstream, which is then decoded to produce $\hat{\mathcal{Y}}^{(2)}$. Both $\hat{\mathcal{Y}}^{(1)}$ and $\hat{\mathcal{Y}}^{(2)}$ are then fed to the fusion model to produce $\hat{\mathcal{Z}}$.
  • Figure 4: The rate-mAP curves for various approaches.
  • Figure 5: A sample feature map for the camera modality (top), and the LiDAR modality (bottom).