Table of Contents
Fetching ...

Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition

Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Eric Granger

TL;DR

This work tackles robust multimodal expression recognition when test-time modalities may be missing by introducing MT-PKDOT, a multi-teacher privileged knowledge distillation framework. It aligns diverse modality-specific teachers with a joint representation via self-distillation and modality adapters, then distills relational knowledge to a student using entropy-regularized optimal transport, complemented by a centroid alignment constraint. Evaluations on Biovid and Affwild2 show MT-PKDOT outperforms single-teacher PKD and visual baselines, with notable gains on Biovid and Affwild2 tasks, while the method can gracefully fall back to the joint multimodal teacher to mitigate negative transfer. The approach demonstrates robustness and scalability across fusion architectures and modalities, highlighting the value of diverse privileged sources for real-world MER applications.

Abstract

Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals. Multimodal emotion recognition systems can perform well because they can learn complementary and redundant semantic information from diverse sensors. In real-world scenarios, only a subset of the modalities employed for training may be available at test time. Learning privileged information allows a model to exploit data from additional modalities that are only available during training. SOTA methods for PKD have been proposed to distill information from a teacher model (with privileged modalities) to a student model (without privileged modalities). However, such PKD methods utilize point-to-point matching and do not explicitly capture the relational information. Recently, methods have been proposed to distill the structural information. However, PKD methods based on structural similarity are primarily confined to learning from a single joint teacher representation, which limits their robustness, accuracy, and ability to learn from diverse multimodal sources. In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student. MT-PKDOT employs a structural similarity KD mechanism based on a regularized optimal transport (OT) for distillation. The proposed MT-PKDOT method was validated on the Affwild2 and Biovid datasets. Results indicate that our proposed method can outperform SOTA PKD methods. It improves the visual-only baseline on Biovid data by 5.5%. On the Affwild2 dataset, the proposed method improves 3% and 5% over the visual-only baseline for valence and arousal respectively. Allowing the student to learn from multiple diverse sources is shown to increase the accuracy and implicitly avoids negative transfer to the student model.

Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition

TL;DR

This work tackles robust multimodal expression recognition when test-time modalities may be missing by introducing MT-PKDOT, a multi-teacher privileged knowledge distillation framework. It aligns diverse modality-specific teachers with a joint representation via self-distillation and modality adapters, then distills relational knowledge to a student using entropy-regularized optimal transport, complemented by a centroid alignment constraint. Evaluations on Biovid and Affwild2 show MT-PKDOT outperforms single-teacher PKD and visual baselines, with notable gains on Biovid and Affwild2 tasks, while the method can gracefully fall back to the joint multimodal teacher to mitigate negative transfer. The approach demonstrates robustness and scalability across fusion architectures and modalities, highlighting the value of diverse privileged sources for real-world MER applications.

Abstract

Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals. Multimodal emotion recognition systems can perform well because they can learn complementary and redundant semantic information from diverse sensors. In real-world scenarios, only a subset of the modalities employed for training may be available at test time. Learning privileged information allows a model to exploit data from additional modalities that are only available during training. SOTA methods for PKD have been proposed to distill information from a teacher model (with privileged modalities) to a student model (without privileged modalities). However, such PKD methods utilize point-to-point matching and do not explicitly capture the relational information. Recently, methods have been proposed to distill the structural information. However, PKD methods based on structural similarity are primarily confined to learning from a single joint teacher representation, which limits their robustness, accuracy, and ability to learn from diverse multimodal sources. In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student. MT-PKDOT employs a structural similarity KD mechanism based on a regularized optimal transport (OT) for distillation. The proposed MT-PKDOT method was validated on the Affwild2 and Biovid datasets. Results indicate that our proposed method can outperform SOTA PKD methods. It improves the visual-only baseline on Biovid data by 5.5%. On the Affwild2 dataset, the proposed method improves 3% and 5% over the visual-only baseline for valence and arousal respectively. Allowing the student to learn from multiple diverse sources is shown to increase the accuracy and implicitly avoids negative transfer to the student model.
Paper Structure (25 sections, 10 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 10 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: A comparison of PKD methods. (a) The point-to-point based PKD pkd-aslam is the vanilla PKD where each point in the student space is matched to the corresponding point in the teacher space. (b) The structural KD-based PKDOT method aslam2024distilling captures the relational information and distills it to the student. (c) In contrast, the proposed MT-PKDOT method creates a multi-teacher pool by aligning the backbone teachers with the joint representation through self-distillation and selecting the most confident teacher. A centroid loss is also introduced as an additional constraint to explicitly minimize the $\ell^2$ distance between the centroids of the teacher and student representations.
  • Figure 2: Illustration of the proposed MT-PKDOT method to train the student model. In the multi-teacher pool, the representation of $n$ modality-specific teachers is aligned using self-distillation. Following the selection of the most-confident teacher for the batch, the relational knowledge is captured using cosine similarity matrices. For similarity structure knowledge transfer, entropy-regularized OT is used to match the teacher and student distributions. A centroid loss is also used as an additional constraint to explicitly minimize the distance between the teacher and student.
  • Figure 3: Illustration of the various fusion architectures employed to obtain the fused representation in the teacher space: (a) feature concatenation, (b) joint cross attention rajasekhar, (c) multimodal transformer waligora2024joint
  • Figure 4: Evolution of the student similarity matrix over training epochs. (a) shows the similarity matrix of the pretrained teacher model. (b) through (e) show the student similarity matrix at 0%, 50%, 75%, and 100% training.
  • Figure 5: Visualization of the teacher representations before (left) and after (right) alignment on the Biovid (B) dataset
  • ...and 2 more figures