Table of Contents
Fetching ...

MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

Yan Feng, Alexander Carballo, Keisuke Fujii, Robin Karlsson, Ming Ding, Kazuya Takeda

TL;DR

MulCPred tackles the explainability gap in pedestrian action prediction by introducing a multi-modal concept-based framework. It maps inputs from multiple modalities into a shared set of concepts, then uses a linear aggregator to produce predictions and ante-hoc explanations, with a channel-wise recalibration module to enforce locality and a diversity loss to prevent mode collapse. The approach yields competitive accuracy on crossing and atomic-action tasks across TITAN and PIE, and gains cross-dataset generalization when unrecognizable concepts are pruned, with an extended MoRF-based faithfulness evaluation supporting interpretability. This work advances explainable AI for autonomous driving by offering interpretable, locality-aware, multi-modal predictions and demonstrating practical benefits for generalization and trustworthiness.

Abstract

Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.

MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

TL;DR

MulCPred tackles the explainability gap in pedestrian action prediction by introducing a multi-modal concept-based framework. It maps inputs from multiple modalities into a shared set of concepts, then uses a linear aggregator to produce predictions and ante-hoc explanations, with a channel-wise recalibration module to enforce locality and a diversity loss to prevent mode collapse. The approach yields competitive accuracy on crossing and atomic-action tasks across TITAN and PIE, and gains cross-dataset generalization when unrecognizable concepts are pruned, with an extended MoRF-based faithfulness evaluation supporting interpretability. This work advances explainable AI for autonomous driving by offering interpretable, locality-aware, multi-modal predictions and demonstrating practical benefits for generalization and trustworthiness.

Abstract

Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.
Paper Structure (19 sections, 11 equations, 6 figures, 4 tables)

This paper contains 19 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of MulCPred. The part above the dashed line illustrates the inference process of MulCPred. The activation scores describe the relation between the inputs and the concepts, while the relevance scores describe the relation between the concepts and the prediction. The part below the dashed line illustrates how to explain the concepts.
  • Figure 2: Overall illustration of the MulCPred architecture. Our model takes multi-modal data as inputs, including spatiotemporal data such as videos or sequential data like trajectories. At each modality branch, the input is projected into several activation scores based on its similarity to a set of concepts. Activation scores from all modalities are integrated by a linear aggregator such that the prediction of the model can be explained in an ante-hoc manner.
  • Figure 3: The visualization process of concepts for spatiotemporal modalities. The featuremap is first passed to a global average pooling (GAP) layer and then an MLP to be projected into a set of recalibration vectors. Each concept represents a prototype distribution of components in the feature. The heatmap that represents the result of the input activating one concept is obtained by a channel-wise multiplication between the feature and the corresponding recalibration vector.
  • Figure 4: Part of the visualization of the concepts learned from 3 tasks, i.e. crossing prediction in TITAN, crossing prediction in PIE and atomic action prediction in TITAN. For each concept, we visualize 3 samples in the training set with the highest activation values corresponding to these concepts. It can be seen that some concepts have learned consistent and reasonable patterns such as crosswalk, bicycle and pedestrian lower body movements, while other concepts have learned patterns that are irrelevant, such as the upper edge of the image and the black padding regions.
  • Figure 5: Visualization of all 10 concepts of appearance modality from TITAN crossing prediction task, TITAN atomic action prediction task, and PIE crossing prediction task when $\lambda_{1}=0$. All these concepts have learned the same pattern which seems to be the black padding region.
  • ...and 1 more figures