Table of Contents
Fetching ...

MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

Muhammad Osama Zeeshan, Natacha Gillet, Alessandro Lameiras Koerich, Marco Pedersoli, Francois Bremond, Eric Granger

TL;DR

MuSACo introduces a multimodal, subject-specific MSDA framework for ER that uses co-training to selectively leverage source subjects and generate target pseudo-labels. By combining class-aware and class-agnostic alignment losses and fusing modality-specific features, MuSACo achieves robust, personalized adaptation across challenging datasets. The approach is shown to outperform unimodal MSDA and multimodal UDA baselines on BioVid, StressID, and BAH, with strong ablations validating the contribution of source selection, disentanglement, and confidence-aware learning. Its backbone-agnostic design and demonstrated health-related applicability underscore its practical impact for personalized affective computing and digital health.

Abstract

Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Experimental results on challenging multimodal ER datasets: BioVid, StressID, and BAH show that MuSACo outperforms UDA (blending) and state-of-the-art MSDA methods.

MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

TL;DR

MuSACo introduces a multimodal, subject-specific MSDA framework for ER that uses co-training to selectively leverage source subjects and generate target pseudo-labels. By combining class-aware and class-agnostic alignment losses and fusing modality-specific features, MuSACo achieves robust, personalized adaptation across challenging datasets. The approach is shown to outperform unimodal MSDA and multimodal UDA baselines on BioVid, StressID, and BAH, with strong ablations validating the contribution of source selection, disentanglement, and confidence-aware learning. Its backbone-agnostic design and demonstrated health-related applicability underscore its practical impact for personalized affective computing and digital health.

Abstract

Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Experimental results on challenging multimodal ER datasets: BioVid, StressID, and BAH show that MuSACo outperforms UDA (blending) and state-of-the-art MSDA methods.

Paper Structure

This paper contains 35 sections, 17 equations, 7 figures, 22 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of MuSACo against unimodal MSDA and multimodal UDA (blending) ER methods for subject-based adaptation. (a) Unimodal MSDA aligns multiple source subjects to the target within a single modality, which reduces its accuracy. (b) Multimodal UDA with blending incorporates single or blended source domains for target adaptation but does not fully exploit multiple subject-specific diversity. (c) MuSACo selects relevant sources per modality using co-training, and aligns them to the target using both class-aware and class-agnostic losses before fusing modalities for final prediction.
  • Figure 2: An overview of MuSACo shown in the particular case of $M$ = 2 modalities (with visual + physiological). First, the similarity between source and target subjects is estimated using the Selection of Source Subjects module, followed by selecting the modality that gives the maximum probability score based on the threshold. Then, Domain Alignment is achieved by the Generation of Target PLs module through co-training. It calculates class-aware loss for each modality, combined with class-agnostic loss to learn from the non-confident target samples. Finally, the Fusion Module is introduced for the Modality Alignment through feature concatenation.
  • Figure 3: Left: t-SNE visualizations. The source-only produced indistinguishable feature class clusters. Sub-based MSDA zeeshan2024subject reduces some noise by separating classes to some extent. MuSACo creates more separable clusters for each class. Right: Visualization of selected and non-selected source subjects for a reference target subject, with similarity scores from visual and physiological modalities.
  • Figure 4: Source subject threshold ($\tau_{ss}$) selection based on the accuracy. The red circle highlights the selected $\tau_{ss}=0.55$.
  • Figure 5: Training of disentanglement with knife-loss and identity-head.
  • ...and 2 more figures