MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
Muhammad Osama Zeeshan, Natacha Gillet, Alessandro Lameiras Koerich, Marco Pedersoli, Francois Bremond, Eric Granger
TL;DR
MuSACo introduces a multimodal, subject-specific MSDA framework for ER that uses co-training to selectively leverage source subjects and generate target pseudo-labels. By combining class-aware and class-agnostic alignment losses and fusing modality-specific features, MuSACo achieves robust, personalized adaptation across challenging datasets. The approach is shown to outperform unimodal MSDA and multimodal UDA baselines on BioVid, StressID, and BAH, with strong ablations validating the contribution of source selection, disentanglement, and confidence-aware learning. Its backbone-agnostic design and demonstrated health-related applicability underscore its practical impact for personalized affective computing and digital health.
Abstract
Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Experimental results on challenging multimodal ER datasets: BioVid, StressID, and BAH show that MuSACo outperforms UDA (blending) and state-of-the-art MSDA methods.
