Table of Contents
Fetching ...

A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

Xinkui Zhao, Jinsong Shu, Yangyang Wu, Guanjie Cheng, Zihe Liu, Naibo Wang, Shuiguang Deng, Zhongle Xie, Jianwei Yin

TL;DR

This paper proposes a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models.

Abstract

Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

TL;DR

This paper proposes a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models.

Abstract

Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) Existing incomplete multimodal learning methods add unimodal prediction losses to enhance characteristic information in fused representations, but suffer from persistent gradient conflicts between modality combinations. (b) Our MCULoRA approach with the MCLA module effectively decouples characteristic and common information in unimodal data, addressing the deficiency of characteristic information in joint multimodal representations.
  • Figure 2: The overall framework of MCULoRA is as follows: During the training phase, in the first cell at the top, MCULoRA first extracts features from the original data. Subsequently, in the first cell at the bottom, the MCLA module decouples the unimodal representations. Finally, in the second cell at the top, the model leverages feature information from modality combination-aware adaptation to assist the joint representation in completing the prediction task. During training, the DPFT module in the second cell at the bottom dynamically adjusts the occurrence probability of different modality combinations based on the current decoupling status of individual modalities, thereby balancing the adaptation degree of single modalities across different combinations. Here, *m* denotes the maximum number of input modalities supported by the current model, and the CA operation refers to the cross-attention mechanism.
  • Figure 3: Ablation study on the rank number for adapter fine-tuning. We conducted experiments on the CMU-MOSEI dataset and tested the emotion recognition accuracy with the rank number of the feature adapter fine-tuning matrix ranging from 1 to 8.
  • Figure 4: Analysis of the training convergence. The performance fluctuations of different models in various modal combinations during the training process are observed.
  • Figure 5: Visualization of the test cases selected from the CMU-MOSEI dataset. It can be observed that our MCULora method, after the supplementation of characteristic information, can make more effective predictions.