Table of Contents
Fetching ...

Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Minkyoung Cho, Yulong Cao, Jiachen Sun, Qingzhao Zhang, Marco Pavone, Jeong Joon Park, Heng Yang, Z. Morley Mao

TL;DR

This work introduces Cocoon, an object- and feature-level uncertainty-aware fusion framework, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression.

Abstract

An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

TL;DR

This work introduces Cocoon, an object- and feature-level uncertainty-aware fusion framework, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression.

Abstract

An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

Paper Structure

This paper contains 24 sections, 13 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Existing Fusion Methods and Our Approach.
  • Figure 2: Impact of fusion ratio on average confidence scores under various lighting conditions and object configurations. The black stars denote the optimal camera-to-LiDAR fusion ratios achieving the highest scores for each configuration. Object configurations are categorized based on two attributes: object size and distance to the ego agent. Object sizes are classified into three categories: small (< 2m), medium (2-4m), and large (> 4m). Similarly, distances are segmented into three ranges: near (< 20m), mid-distance (20-40m), and far (> 40m).
  • Figure 3: Cocoon Online Procedure (left) and Example Results (right): Cocoon operates on top of base model components. In the feature aligner, per-object features (, ) are aligned or projected into a common representation space. Next, uncertainty quantification is performed for each pair of features (, ). These uncertainties are converted into weights ($\alpha$ and $\beta$) for adaptive fusion, which either amplify or attenuate the contribution of each modality’s original feature (, ) to the fused feature. The resulting fused feature is then used in the main decoder of the base model.
  • Figure 3: Accuracy Breakdown w.r.t. Distance.
  • Figure 4: Feature CP vs. Cocoon. In the offline stage with calibration data, Feature CP identifies the surrogate ground truth () for each feature () through iterative search. Each is derived using the real ground truth label in the output space and the decoder $g$ (serving as a classifier). However, in a multi-modal setting, each feature lacks a modality-specific $g$. To resolve this, Cocoon leverages joint training of the feature aligner (which projects heterogeneous features (, ) into a common representation space) and the surrogate ground truth (-- termed FI). Through our proposed training objective, which makes each FI to be the geometric median of aggregated features for valid uncertainty quantification. In both cases, the nonconformity scores (i.e., distances) are collected to create a calibration set, which will be used as a criterion to gauge the uncertainty of online test inputs. In the online stage with test data, while Feature CP iteratively searches for , Cocoon saves time by projecting input features via our feature aligner $h$ and using a pre-trained .
  • ...and 6 more figures