Table of Contents
Fetching ...

AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, Heikki Kälviäinen

TL;DR

AU-TTT addresses cross-domain generalization in Facial Action Unit detection by injecting Test-Time Training into a vision backbone designed for AU cues. It introduces forward, bidirectional, and AU RoI TTT pathways, augmented with Multi-Scale Perception, to capture both global context and fine-grained AU features, while using MDWA and WDI losses plus MSE for supervision. The method achieves competitive within-domain results and strong cross-domain generalization on DISFA and BP4D, using only ImageNet pretraining and avoiding external data like StyleGAN features. This work demonstrates that test-time adaptation, when tailored to vision tasks and AU regions, can reduce overfitting and improve robustness in AU detection scenarios with limited labeled data.

Abstract

Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.

AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

TL;DR

AU-TTT addresses cross-domain generalization in Facial Action Unit detection by injecting Test-Time Training into a vision backbone designed for AU cues. It introduces forward, bidirectional, and AU RoI TTT pathways, augmented with Multi-Scale Perception, to capture both global context and fine-grained AU features, while using MDWA and WDI losses plus MSE for supervision. The method achieves competitive within-domain results and strong cross-domain generalization on DISFA and BP4D, using only ImageNet pretraining and avoiding external data like StyleGAN features. This work demonstrates that test-time adaptation, when tailored to vision tasks and AU regions, can reduce overfitting and improve robustness in AU detection scenarios with limited labeled data.

Abstract

Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.

Paper Structure

This paper contains 10 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Performance (F1 score) gap between the within- and cross-domain AU detection for DRML zhao2016deep, J$\hat{\mathrm{A}}$A-Net shao2021jaa, MEGraphAU luo2022learning, FG-Net yin2024fg, AUFormer yuan2025auformer, and AU-TTT(Ours). The within-domain performance is averaged between DISFA and BP4D, while the cross-domain performance is averaged between BP4D to DISFA and DISFA to BP4D.
  • Figure 2: A: Original TTT Block. The basic building block of Transformers, originally based on self-attention, is replaced with the TTT Layer in the Transformer backbone. B: Original TTT Layer, driven by self-supervised loss, updates the weights adaptively. C: Hidden state update rule. The key idea of TTT is to make the hidden state itself a model $f$ with weights $W$ , and the update rule a gradient step on the self-supervised loss $\ell$. Therefore, updating the hidden state on a test sequence is equivalent to training the model $f$ at test time sun2024learning.
  • Figure 3: Top Left: The overall AU-TTT framework. Top Right: The architecture of the AU-TTT Block. Bottom Left: The original TTT Layer scanning method. Bottom Center: The bidirectional TTT scanning method. Bottom Right: The AU RoI TTT scanning method.