Learned Multimodal Compression for Autonomous Driving
Hadi Hadizadeh, Ivan V. Bajić
TL;DR
This work tackles the heavy data burden in autonomous driving by proposing learned multimodal compression for camera and LiDAR aimed at preserving 3D object detection performance. Building on FUTR3D, it compares three coding strategies that either jointly compress fused features or conditionally compress one modality based on the other, using ANFIC and CANF codecs to operate on feature maps after modality fusion. The evaluation on nuScenes shows that joint coding of fused features achieves the best rate-accuracy trade-off, with substantial rate reductions (e.g., ~67.7% BD-Rate improvement over a baseline) and the ability to reach no-compression mAP at about 21 KB per sample, highlighting strong task-relevant information preservation. These results demonstrate the practical potential of learned multimodal, coding-for-machines approaches for efficient onboard and edge-cloud autonomous driving systems, enabling real-time decision-making with significantly reduced data transfer and computation.
Abstract
Autonomous driving sensors generate an enormous amount of data. In this paper, we explore learned multimodal compression for autonomous driving, specifically targeted at 3D object detection. We focus on camera and LiDAR modalities and explore several coding approaches. One approach involves joint coding of fused modalities, while others involve coding one modality first, followed by conditional coding of the other modality. We evaluate the performance of these coding schemes on the nuScenes dataset. Our experimental results indicate that joint coding of fused modalities yields better results compared to the alternatives.
