Table of Contents
Fetching ...

FE-Adapter: Adapting Image-based Emotion Classifiers to Videos

Shreyank N Gowda, Boyan Gao, David A. Clifton

TL;DR

This work tackles the high cost of fine-tuning large pre-trained image models for video emotion recognition by introducing the FE-Adapter, a parameter-efficient cross-modality transfer learning approach. By embedding a Dynamic Dilated Conv3D–based adapter into image models and placing adapters before the multi-head self-attention layers, the method captures temporal dynamics with only a fraction of the trainable parameters (about $8\%$ per task and ~$15\times$ fewer updated parameters than prior methods). Across DFEW, FERV39k, and MAFW, FE-Adapter achieves competitive or superior accuracy while maintaining substantially lower computational and memory demands. This demonstrates the viability of cross-modality transfer learning for efficient, accurate video understanding and highlights the broader potential of adapter-based strategies in resource-constrained downstream tasks.

Abstract

Utilizing large pre-trained models for specific tasks has yielded impressive results. However, fully fine-tuning these increasingly large models is becoming prohibitively resource-intensive. This has led to a focus on more parameter-efficient transfer learning, primarily within the same modality. But this approach has limitations, particularly in video understanding where suitable pre-trained models are less common. Addressing this, our study introduces a novel cross-modality transfer learning approach from images to videos, which we call parameter-efficient image-to-video transfer learning. We present the Facial-Emotion Adapter (FE-Adapter), designed for efficient fine-tuning in video tasks. This adapter allows pre-trained image models, which traditionally lack temporal processing capabilities, to analyze dynamic video content efficiently. Notably, it uses about 15 times fewer parameters than previous methods, while improving accuracy. Our experiments in video emotion recognition demonstrate that the FE-Adapter can match or even surpass existing fine-tuning and video emotion models in both performance and efficiency. This breakthrough highlights the potential for cross-modality approaches in enhancing the capabilities of AI models, particularly in fields like video emotion analysis where the demand for efficiency and accuracy is constantly rising.

FE-Adapter: Adapting Image-based Emotion Classifiers to Videos

TL;DR

This work tackles the high cost of fine-tuning large pre-trained image models for video emotion recognition by introducing the FE-Adapter, a parameter-efficient cross-modality transfer learning approach. By embedding a Dynamic Dilated Conv3D–based adapter into image models and placing adapters before the multi-head self-attention layers, the method captures temporal dynamics with only a fraction of the trainable parameters (about per task and ~ fewer updated parameters than prior methods). Across DFEW, FERV39k, and MAFW, FE-Adapter achieves competitive or superior accuracy while maintaining substantially lower computational and memory demands. This demonstrates the viability of cross-modality transfer learning for efficient, accurate video understanding and highlights the broader potential of adapter-based strategies in resource-constrained downstream tasks.

Abstract

Utilizing large pre-trained models for specific tasks has yielded impressive results. However, fully fine-tuning these increasingly large models is becoming prohibitively resource-intensive. This has led to a focus on more parameter-efficient transfer learning, primarily within the same modality. But this approach has limitations, particularly in video understanding where suitable pre-trained models are less common. Addressing this, our study introduces a novel cross-modality transfer learning approach from images to videos, which we call parameter-efficient image-to-video transfer learning. We present the Facial-Emotion Adapter (FE-Adapter), designed for efficient fine-tuning in video tasks. This adapter allows pre-trained image models, which traditionally lack temporal processing capabilities, to analyze dynamic video content efficiently. Notably, it uses about 15 times fewer parameters than previous methods, while improving accuracy. Our experiments in video emotion recognition demonstrate that the FE-Adapter can match or even surpass existing fine-tuning and video emotion models in both performance and efficiency. This breakthrough highlights the potential for cross-modality approaches in enhancing the capabilities of AI models, particularly in fields like video emotion analysis where the demand for efficiency and accuracy is constantly rising.
Paper Structure (19 sections, 2 equations, 2 figures, 5 tables)

This paper contains 19 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: A comparative analysis of various video-based models on the DFEW dfew dataset, showcasing the correlation between the number of tunable parameters (in millions) and model accuracy (%). Our proposed FE-Adapter (highlighted in blue and bold), requires significantly fewer trainable parameters whilst outperforming recent SOTA models including vision-language models. The size of each bubble represents the number of tuneable parameters of the respective model. We compare with recent SOTA models such as IAL IAL, EST EST, and DFER-CLIP DFER-CLIP. We also compare against older 3D based models such as 3D-ResNet resnet and C3D c3d.
  • Figure 2: Adapters ensure minimal parameter updates whilst keeping the generalization ability of the pre-trained model consistent. In comparison to (a) full fine-tuning, using (b) adapter modules significantly reduces tuneable parameter count.