Table of Contents
Fetching ...

Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion

Jaehyun Park, Konyul Park, Daehun Kim, Junseo Park, Jun Won Choi

TL;DR

Multimodal fusion in autonomous driving often yields opaque, entangled decisions across camera, radar, and LiDAR inputs. Layer-Wise Modality Decomposition (LMD) provides a post-hoc, model-agnostic framework that locally linearizes each network layer to disentangle modality-specific information while preserving the original predictions, yielding an exact decomposition $F_j^l = h_{cj}^l + h_{rj}^l + h_{bj}^l$ at every layer. By linearizing activations and normalizations with rules such as the ratio rule for LayerNorm and identity or uniform handling for BatchNorm, LMD ensures both the equality and separation properties across the entire network, enabling clear, modality-wise explanations without changing the architecture. Empirical results on camera-radar, LiDAR-camera, and tri-modal fusion demonstrate robust modality separation via perturbation-based metrics and competitive efficiency (two forward passes, $O(1)$ auxiliary state), with extensions to SHAP and attention-based models highlighting practical impact for safety-critical deployment and cross-domain applicability.

Abstract

In autonomous driving, transparency in the decision-making of perception models is critical, as even a single misperception can be catastrophic. Yet with multi-sensor inputs, it is difficult to determine how each modality contributes to a prediction because sensor information becomes entangled within the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a post-hoc, model-agnostic interpretability method that disentangles modality-specific information across all layers of a pretrained fusion model. To our knowledge, LMD is the first approach to attribute the predictions of a perception model to individual input modalities in a sensor-fusion system for autonomous driving. We evaluate LMD on pretrained fusion models under camera-radar, camera-LiDAR, and camera-radar-LiDAR settings for autonomous driving. Its effectiveness is validated using structured perturbation-based metrics and modality-wise visual decompositions, demonstrating practical applicability to interpreting high-capacity multimodal architectures. Code is available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.

Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion

TL;DR

Multimodal fusion in autonomous driving often yields opaque, entangled decisions across camera, radar, and LiDAR inputs. Layer-Wise Modality Decomposition (LMD) provides a post-hoc, model-agnostic framework that locally linearizes each network layer to disentangle modality-specific information while preserving the original predictions, yielding an exact decomposition at every layer. By linearizing activations and normalizations with rules such as the ratio rule for LayerNorm and identity or uniform handling for BatchNorm, LMD ensures both the equality and separation properties across the entire network, enabling clear, modality-wise explanations without changing the architecture. Empirical results on camera-radar, LiDAR-camera, and tri-modal fusion demonstrate robust modality separation via perturbation-based metrics and competitive efficiency (two forward passes, auxiliary state), with extensions to SHAP and attention-based models highlighting practical impact for safety-critical deployment and cross-domain applicability.

Abstract

In autonomous driving, transparency in the decision-making of perception models is critical, as even a single misperception can be catastrophic. Yet with multi-sensor inputs, it is difficult to determine how each modality contributes to a prediction because sensor information becomes entangled within the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a post-hoc, model-agnostic interpretability method that disentangles modality-specific information across all layers of a pretrained fusion model. To our knowledge, LMD is the first approach to attribute the predictions of a perception model to individual input modalities in a sensor-fusion system for autonomous driving. We evaluate LMD on pretrained fusion models under camera-radar, camera-LiDAR, and camera-radar-LiDAR settings for autonomous driving. Its effectiveness is validated using structured perturbation-based metrics and modality-wise visual decompositions, demonstrating practical applicability to interpreting high-capacity multimodal architectures. Code is available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.

Paper Structure

This paper contains 51 sections, 3 theorems, 53 equations, 6 figures, 6 tables.

Key Result

Proposition 1

Let $\hat{f}^{l}$ be the locally linearized activation function, This construction is equivalent to setting the diagonal entries of the ${{\mathbf{J}^{l}_{ji}}}$ in first-order-taylor with the slope of the line or segment joining the two operating points $\bigl(F_{j}^{\,l-1}(\mathbf{x}_{\mathrm{c}},\mathbf{x}_{\mathrm{r}}), F_{j}^{\,l}(\mathbf{x}_{\mathrm{c}},\ma

Figures (6)

  • Figure 1: Overall Process of LMD. LMD decomposes the multimodal features into modality-specific components through a two-stage process.
  • Figure 2: Post-hoc Interpretation through LMD : In the first row, a comparison between (c) and (d) shows that the model successfully detects the vehicle using radar data, as indicated by the green marker, whereas the camera-based prediction lacks confidence. Similarly, in the second row, the green marker in (d) highlights either a correct or incorrect prediction made by one modality that was not captured by the other. The prediction from bias in (e) exhibits a certain degree of perceptual capability. This component includes constant effect and high-order interactions largely originated from linearization of activation layers.
  • Figure 3: Computational Complexity and Memory Consumption (Single Forward-pass Measurement). $M$ : number of modalities, $N_L$ : number of layers
  • Figure 3: Visualizations of using Sigmoid.
  • Figure 4: Visualizations of both Positive and Negative Values.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Proposition 3