Table of Contents
Fetching ...

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu

TL;DR

A novel approach that synergistically enhances audio-visual data processing and paves the way for a nuanced comprehension of complex emotional and behavioral expressions in real-world scenarios is introduced.

Abstract

Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Moreover, this paper adaptively captures fusion features across modalities by modeling the temporal relationships, and ultilizes a pre-trained GPT-2 model for sophisticated context-aware fusion of multimodal information. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios. These findings underscore the potential of integrating temporal dynamics and contextual interpretation, paving the way for future research endeavors.

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

TL;DR

A novel approach that synergistically enhances audio-visual data processing and paves the way for a nuanced comprehension of complex emotional and behavioral expressions in real-world scenarios is introduced.

Abstract

Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Moreover, this paper adaptively captures fusion features across modalities by modeling the temporal relationships, and ultilizes a pre-trained GPT-2 model for sophisticated context-aware fusion of multimodal information. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios. These findings underscore the potential of integrating temporal dynamics and contextual interpretation, paving the way for future research endeavors.
Paper Structure (14 sections, 8 equations, 1 figure, 2 tables)

This paper contains 14 sections, 8 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The flowchart presents a multimodal approach for detecting facial action units, employing pre-trained iResnet50 networks for initial feature extraction from video and audio, which are then refined through Temporal Convolutional Networks to capture the temporal dynamics. These features are integrated via a fine-tuned GPT-2 model before being classified by an AU detection head. The detailed submodules illustrate the internal workings of the TCN, emphasizing its dilated convolution blocks for expansive temporal feature capture, and the GPT-2 model, highlighting the transformer mechanism and fine-tuning approach that enables contextual understanding of the features.