Table of Contents
Fetching ...

Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling

Xinyu Li, Marwa Mahmoud

TL;DR

This work tackles the challenge of explainability in affective computing under regulatory and ethical pressures by introducing the Attention-Guided Concept Model (AGCM), a multimodal, concept-based framework that yields what-concepts contributed and where they appear in the input. AGCM integrates a transformer-based visual backbone with spatial-attention and concept-learning modules, plus an extendable multimodal fusion mechanism to co-learn from audio-visual signals. The approach achieves state-of-the-art or competitive performance on FER datasets (RAF-DB, AffectNet) and demonstrates strong generalization to downstream tasks like engagement estimation on NOXI, all while providing human-interpretable explanations via activated concepts and spatial-temporal maps. The results show that combining domain-specific concept supervision with multimodal co-learning can deliver both high predictive accuracy and meaningful, actionable explanations for real-world affective computing applications, with avenues for future human-in-the-loop and fairness-focused enhancements.

Abstract

In the contemporary era of intelligent connectivity, Affective Computing (AC), which enables systems to recognize, interpret, and respond to human behavior states, has become an integrated part of many AI systems. As one of the most critical components of responsible AI and trustworthiness in all human-centered systems, explainability has been a major concern in AC. Particularly, the recently released EU General Data Protection Regulation requires any high-risk AI systems to be sufficiently interpretable, including biometric-based systems and emotion recognition systems widely used in the affective computing field. Existing explainable methods often compromise between interpretability and performance. Most of them focus only on highlighting key network parameters without offering meaningful, domain-specific explanations to the stakeholders. Additionally, they also face challenges in effectively co-learning and explaining insights from multimodal data sources. To address these limitations, we propose a novel and generalizable framework, namely the Attention-Guided Concept Model (AGCM), which provides learnable conceptual explanations by identifying what concepts that lead to the predictions and where they are observed. AGCM is extendable to any spatial and temporal signals through multimodal concept alignment and co-learning, empowering stakeholders with deeper insights into the model's decision-making process. We validate the efficiency of AGCM on well-established Facial Expression Recognition benchmark datasets while also demonstrating its generalizability on more complex real-world human behavior understanding applications.

Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling

TL;DR

This work tackles the challenge of explainability in affective computing under regulatory and ethical pressures by introducing the Attention-Guided Concept Model (AGCM), a multimodal, concept-based framework that yields what-concepts contributed and where they appear in the input. AGCM integrates a transformer-based visual backbone with spatial-attention and concept-learning modules, plus an extendable multimodal fusion mechanism to co-learn from audio-visual signals. The approach achieves state-of-the-art or competitive performance on FER datasets (RAF-DB, AffectNet) and demonstrates strong generalization to downstream tasks like engagement estimation on NOXI, all while providing human-interpretable explanations via activated concepts and spatial-temporal maps. The results show that combining domain-specific concept supervision with multimodal co-learning can deliver both high predictive accuracy and meaningful, actionable explanations for real-world affective computing applications, with avenues for future human-in-the-loop and fairness-focused enhancements.

Abstract

In the contemporary era of intelligent connectivity, Affective Computing (AC), which enables systems to recognize, interpret, and respond to human behavior states, has become an integrated part of many AI systems. As one of the most critical components of responsible AI and trustworthiness in all human-centered systems, explainability has been a major concern in AC. Particularly, the recently released EU General Data Protection Regulation requires any high-risk AI systems to be sufficiently interpretable, including biometric-based systems and emotion recognition systems widely used in the affective computing field. Existing explainable methods often compromise between interpretability and performance. Most of them focus only on highlighting key network parameters without offering meaningful, domain-specific explanations to the stakeholders. Additionally, they also face challenges in effectively co-learning and explaining insights from multimodal data sources. To address these limitations, we propose a novel and generalizable framework, namely the Attention-Guided Concept Model (AGCM), which provides learnable conceptual explanations by identifying what concepts that lead to the predictions and where they are observed. AGCM is extendable to any spatial and temporal signals through multimodal concept alignment and co-learning, empowering stakeholders with deeper insights into the model's decision-making process. We validate the efficiency of AGCM on well-established Facial Expression Recognition benchmark datasets while also demonstrating its generalizability on more complex real-world human behavior understanding applications.

Paper Structure

This paper contains 30 sections, 3 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Difference between the black-box models, current eXplainable AI (XAI), and our proposed model. (a): Black-box ML models offer no extra insight into the model prediction. (b): Map-based XAI approaches offer explanations by identifying important regions that lead to the prediction, but without any domain-specific knowledge that validates the decision-making process. (c): Our proposed framework explicitly localizes domain-specific indicators, learns their contributions during training, and incorporates multimodal concepts, thereby making predictions based on these intermediate attributes in an inherently interpretable manner.
  • Figure 2: Feature-based approaches offer inherent interpretability and are easily understood by humans, while end-to-end models deliver state-of-the-art learning capabilities. This work seeks to integrate the strengths of both methods through a concept-based framework, which achieves a balance between high explainability and robust performance. Unlike traditional features, concepts are not static values. They serve as the neural embeddings of features that are trainable within the ML framework, spontaneously quantifying the contribution of individual concepts to the task label.
  • Figure 3: Example of patch-level AU map generated using landmark detection and average interpolation.
  • Figure 4: The architecture of our proposed Attention-Guided Concept Model (AGCM) for the spatial visual modality. The model uses a transformer backbone $\varphi(\cdot)$ to convert the facial image $x$ into a patch-level representation. The Attention-Guided Concept Generator (ACG) applies spatial-channel attention with a Multi-scale Spatial Attention (MSA) block and Channel Attended Concept Mapping (CACM), which together capture attention across both spatial and feature dimensions. The MSA block focuses on spatial features at multiple scales, enhancing the model's ability to capture both fine and coarse details. For instance, the concept of the cheek region may benefit from a larger attention area compared to the eye region. Three MSA heads are used to capture diverse spatial patterns within an image, each generating a concept attention map $\hat{a}_{i}$. These maps are weighted and summed to produce the final concept attention map, which is used to update the concept map loss during training. CACM further improves the model's focus on the most informative features along the channel dimension, ensuring robust feature selection across multiple channels. A concept probability generator $p(\cdot)$ computes the probability of each activated concept, facilitating concept supervision by showing the contribution of individual concepts to the predicted label. Notably, ACG considers both activated and inactivated concept embeddings, as the absence of certain concepts (e.g., AUs) can provide additional information about a subject's facial expression. The predicted activated concepts, $\hat{c}_{i}^{+}$, and inactivated concepts, $\hat{c}_{i}^{-}$, are weighted by their respective probabilities from $p(\cdot)$, then concatenated and passed to the one-layer fully-connected task predictor $y(\cdot)$ to generate the final task label $\hat{{\textit{t}}}$. During loss computation, the model optimizes its performance using the task loss, concept probability loss, and concept map loss associated with the spatial concept attention, ensuring a strong explainability of the model's decision-making process giving not only what key concepts contribute the most to the prediction but also where these concepts appear.
  • Figure 5: In the multimodal fusion stage, the pre-learned visual branch functions as a Visual Attention-Guided Concept Generator. The parameters of the Visual Attention-Guided Concept Generator are frozen to ensure reliable visual concept predictions. On the audio side, an Acoustic Concept Generator (ACG) processes the audio input, generating activated ($\hat{c}_{i}^{+}$) and inactivated ($\hat{c}_{i}^{-}$) acoustic concept embeddings via an acoustic feature extractor $G(\cdot)$. The probability of each concept's activation is computed using an acoustic concept probability generator $p(\cdot)$. The acoustic concept embeddings are concatenated with their corresponding visual concept set and passed through a sequential bottleneck layer $\hat{c}_{0}$, ...$\hat{c}_{k}$, where $k$ represents the number of samples in the sequence. For a given video clip, it is assumed that acoustic concepts are shared across all frames. A sequence-to-sequence label predictor $y(\cdot)$ is then used to capture the contextual relationships between frames to generate the final by-frame task label. Importantly, the AGCM framework is inherently extendable to other temporal modalities by adding additional branches to accommodate new data inputs, as long as the appropriate data and annotations are available.
  • ...and 10 more figures