Table of Contents
Fetching ...

Lightweight Models for Emotional Analysis in Video

Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh

TL;DR

The paper tackles real-time emotion analysis in unconstrained environments by proposing a lightweight spatiotemporal framework that combines MobileNetV4 as an efficient visual backbone with a three-level Multiscale 3D MLP-Mixer Temporal Aggregation Module to capture temporal dynamics. It leverages a mobile-friendly backbone, pretraining on AffectNet, and a multiscale TAM to produce robust features across resolutions, enabling end-to-end sequence predictions for multiple affective tasks. Evaluations on ABAW8 datasets demonstrate competitive performance for Valence-Arousal estimation, Action Unit detection, and Emotional Mimicry Intensity estimation, emphasizing computational efficiency suitable for mobile and embedded deployments. The approach advances practical affective computing by balancing prediction accuracy with real-time feasibility in diverse, in-the-wild conditions.

Abstract

In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.

Lightweight Models for Emotional Analysis in Video

TL;DR

The paper tackles real-time emotion analysis in unconstrained environments by proposing a lightweight spatiotemporal framework that combines MobileNetV4 as an efficient visual backbone with a three-level Multiscale 3D MLP-Mixer Temporal Aggregation Module to capture temporal dynamics. It leverages a mobile-friendly backbone, pretraining on AffectNet, and a multiscale TAM to produce robust features across resolutions, enabling end-to-end sequence predictions for multiple affective tasks. Evaluations on ABAW8 datasets demonstrate competitive performance for Valence-Arousal estimation, Action Unit detection, and Emotional Mimicry Intensity estimation, emphasizing computational efficiency suitable for mobile and embedded deployments. The approach advances practical affective computing by balancing prediction accuracy with real-time feasibility in diverse, in-the-wild conditions.

Abstract

In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.

Paper Structure

This paper contains 13 sections, 5 equations, 2 tables.