Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling
Xinyu Li, Marwa Mahmoud
TL;DR
This work tackles the challenge of explainability in affective computing under regulatory and ethical pressures by introducing the Attention-Guided Concept Model (AGCM), a multimodal, concept-based framework that yields what-concepts contributed and where they appear in the input. AGCM integrates a transformer-based visual backbone with spatial-attention and concept-learning modules, plus an extendable multimodal fusion mechanism to co-learn from audio-visual signals. The approach achieves state-of-the-art or competitive performance on FER datasets (RAF-DB, AffectNet) and demonstrates strong generalization to downstream tasks like engagement estimation on NOXI, all while providing human-interpretable explanations via activated concepts and spatial-temporal maps. The results show that combining domain-specific concept supervision with multimodal co-learning can deliver both high predictive accuracy and meaningful, actionable explanations for real-world affective computing applications, with avenues for future human-in-the-loop and fairness-focused enhancements.
Abstract
In the contemporary era of intelligent connectivity, Affective Computing (AC), which enables systems to recognize, interpret, and respond to human behavior states, has become an integrated part of many AI systems. As one of the most critical components of responsible AI and trustworthiness in all human-centered systems, explainability has been a major concern in AC. Particularly, the recently released EU General Data Protection Regulation requires any high-risk AI systems to be sufficiently interpretable, including biometric-based systems and emotion recognition systems widely used in the affective computing field. Existing explainable methods often compromise between interpretability and performance. Most of them focus only on highlighting key network parameters without offering meaningful, domain-specific explanations to the stakeholders. Additionally, they also face challenges in effectively co-learning and explaining insights from multimodal data sources. To address these limitations, we propose a novel and generalizable framework, namely the Attention-Guided Concept Model (AGCM), which provides learnable conceptual explanations by identifying what concepts that lead to the predictions and where they are observed. AGCM is extendable to any spatial and temporal signals through multimodal concept alignment and co-learning, empowering stakeholders with deeper insights into the model's decision-making process. We validate the efficiency of AGCM on well-established Facial Expression Recognition benchmark datasets while also demonstrating its generalizability on more complex real-world human behavior understanding applications.
