Table of Contents
Fetching ...

Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Koerich, Simon Bacon, Eric Granger

TL;DR

This work addresses the challenge of robust expression recognition when test-time modalities are incomplete by leveraging privileged information available during training. It introduces PKDOT, an entropy-regularized Optimal Transport-based structural knowledge distillation method that transfers local teacher-space structure to a student, using a T-Net to hallucinate privileged features at inference. The approach computes cosine batch similarity matrices to capture relational structure and applies OT to align teacher and student representations, focusing on top-$k$ anchor samples for sparsity. Experiments on Biovid and Affwild2 demonstrate that PKDOT outperforms state-of-the-art privileged KD baselines across varying fusion architectures and modality configurations, indicating its modality- and model-agnostic applicability and potential for real-world, in-the-wild MER tasks. The method offers practical impact by improving performance when privileged modalities are expensive or unavailable at test time, while maintaining a simple, flexible training framework.

Abstract

Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill information from multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching, yet have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. Experiments were performed on two challenging problems - pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that our proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity among modalities and fusion architectures indicates that PKDOT is modality- and model-agnostic.

Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

TL;DR

This work addresses the challenge of robust expression recognition when test-time modalities are incomplete by leveraging privileged information available during training. It introduces PKDOT, an entropy-regularized Optimal Transport-based structural knowledge distillation method that transfers local teacher-space structure to a student, using a T-Net to hallucinate privileged features at inference. The approach computes cosine batch similarity matrices to capture relational structure and applies OT to align teacher and student representations, focusing on top- anchor samples for sparsity. Experiments on Biovid and Affwild2 demonstrate that PKDOT outperforms state-of-the-art privileged KD baselines across varying fusion architectures and modality configurations, indicating its modality- and model-agnostic applicability and potential for real-world, in-the-wild MER tasks. The method offers practical impact by improving performance when privileged modalities are expensive or unavailable at test time, while maintaining a simple, flexible training framework.

Abstract

Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill information from multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching, yet have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. Experiments were performed on two challenging problems - pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that our proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity among modalities and fusion architectures indicates that PKDOT is modality- and model-agnostic.
Paper Structure (21 sections, 9 equations, 7 figures, 5 tables)

This paper contains 21 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the proposed Privileged Knowledge Distillation with Optimal Transport (PKDOT) method that captures the local structure in the multimodal teacher representation. Teacher backbones process multimodal input data that is privileged (red arrow) and prevalent (black arrows), while student backbones only process prevalent modalities
  • Figure 2: (Left) Conventional privileged KD computes point-to-point distance without considering local structure. (Right) The proposed PKDOT method captures the local structure and matches teacher and student representations by distilling the structural dark knowledge (adapted from Park2019RelationalKD).
  • Figure 3: Illustration of the proposed PKDOT method with prevalent and privileged modality backbones and fusion. The teacher network (top) is trained on both prevalent and privileged modalities, while the student network (bottom) only inputs the prevalent modality. It hallucinates the features of the privileged modality and generates student embeddings in the multimodal space. Entropy-regularized OT is used to distill the structural dark knowledge.
  • Figure 4: Illustration of different fusion mechanisms employed to obtain the joint teacher representation: (a) feature concatenation, (b) joint cross attention rajasekhar, (c) multimodal transformer.
  • Figure 5: Evolution of the similarity matrix over training epochs.
  • ...and 2 more figures