Table of Contents
Fetching ...

Weakly Supervised Learning for Facial Behavior Analysis : A Review

R. Gnana Praveen, Patrick Cardinal, Eric Granger

TL;DR

This survey tackles the annotation bottleneck in facial affective behavior analysis by organizing weakly supervised learning (WSL) approaches across four supervision types (inexact, incomplete, inaccurate, indirect) and two tasks (expression and AU recognition). It systematically reviews classification and regression methods under these regimes, detailing architectures, datasets, evaluation protocols, and key results, with emphasis on temporal modeling, AU interdependencies, pseudo-label refinement, and uncertainty estimation. The authors synthesize open challenges—such as few-shot learning, multimodal integration, fairness, and interpretability—and propose directions like unified, interpretable foundation models and LLM-assisted supervision to advance WABA in real-world settings. Overall, the work provides a comprehensive, structured reference for researchers aiming to develop data-efficient, robust FABA systems under weak supervision across diverse modalities and tasks.

Abstract

In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of deep learning based approaches for many real world applications.However, the performance of deep learning approaches relies on the amount of training data. One of the major problems with data acquisition is the requirement of annotations for large amount of training data. Labeling process of huge training data demands lot of human support with strong domain expertise for facial expressions or action units, which is difficult to obtain in real-time environments.Moreover, labeling process is highly vulnerable to ambiguity of expressions or action units, especially for intensities due to the bias induced by the domain experts. Therefore, there is an imperative need to address the problem of facial behavior analysis with weak annotations. In this paper, we provide a comprehensive review of weakly supervised learning (WSL) approaches for facial behavior analysis with both categorical as well as dimensional labels along with the challenges and potential research directions associated with it. First, we introduce various types of weak annotations in the context of facial behavior analysis and the corresponding challenges associated with it. We then systematically review the existing state-of-the-art approaches and provide a taxonomy of these approaches along with their insights and limitations. In addition, widely used data-sets in the reviewed literature and the performance of these approaches along with evaluation principles are summarized. Finally, we discuss the remaining challenges and opportunities along with the potential research directions in order to apply facial behavior analysis with weak labels in real life situations.

Weakly Supervised Learning for Facial Behavior Analysis : A Review

TL;DR

This survey tackles the annotation bottleneck in facial affective behavior analysis by organizing weakly supervised learning (WSL) approaches across four supervision types (inexact, incomplete, inaccurate, indirect) and two tasks (expression and AU recognition). It systematically reviews classification and regression methods under these regimes, detailing architectures, datasets, evaluation protocols, and key results, with emphasis on temporal modeling, AU interdependencies, pseudo-label refinement, and uncertainty estimation. The authors synthesize open challenges—such as few-shot learning, multimodal integration, fairness, and interpretability—and propose directions like unified, interpretable foundation models and LLM-assisted supervision to advance WABA in real-world settings. Overall, the work provides a comprehensive, structured reference for researchers aiming to develop data-efficient, robust FABA systems under weak supervision across diverse modalities and tasks.

Abstract

In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of deep learning based approaches for many real world applications.However, the performance of deep learning approaches relies on the amount of training data. One of the major problems with data acquisition is the requirement of annotations for large amount of training data. Labeling process of huge training data demands lot of human support with strong domain expertise for facial expressions or action units, which is difficult to obtain in real-time environments.Moreover, labeling process is highly vulnerable to ambiguity of expressions or action units, especially for intensities due to the bias induced by the domain experts. Therefore, there is an imperative need to address the problem of facial behavior analysis with weak annotations. In this paper, we provide a comprehensive review of weakly supervised learning (WSL) approaches for facial behavior analysis with both categorical as well as dimensional labels along with the challenges and potential research directions associated with it. First, we introduce various types of weak annotations in the context of facial behavior analysis and the corresponding challenges associated with it. We then systematically review the existing state-of-the-art approaches and provide a taxonomy of these approaches along with their insights and limitations. In addition, widely used data-sets in the reviewed literature and the performance of these approaches along with evaluation principles are summarized. Finally, we discuss the remaining challenges and opportunities along with the potential research directions in order to apply facial behavior analysis with weak labels in real life situations.

Paper Structure

This paper contains 45 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples of facial images with AUs Martinez2016.
  • Figure 2: Examples of primary universal emotions. From left to right: neutral, happy, sad, fear, angry, surprise, and disgust. Images are taken from the Compound Facial Expressions of Emotions dataset FE.
  • Figure 3: An illustration of WSL scenarios for expression recognition in videos. (a) Inexact WSL: MIL with sequence-level expression labels. (b) Incomplete WSL: semi-supervised learning (SSL) using partial frame-level expression labels. (c) Inaccurate WSL: learning with noisy expression labels. (d) Indirect WSL: learning with indirect expression labels.${\mathbf{X}}_1$, ..., ${\mathbf{X}}_N$ represents $N$ input video sequences. ${\mathbf{Y}}_1$, ..., ${\mathbf{Y}}_N$ represents the sequence-level labels, while $\widetilde{\mathbf{Y}}_1$, ..., $\widetilde{\mathbf{Y}}_N$ represents the sequence-level model predictions. ${\mathbf x}_{11}$, ${\mathbf x}_{12}$, ..., ${\mathbf x}_{1n}$ are the frames of first sequence ${\mathbf{X}}_1$, and ${\mathbf y}_{11}$, ${\mathbf y}_{12}$, ..., ${\mathbf y}_{1n}$ represents their respective frame-level expression labels. $\widetilde{\mathbf{y}}_1$, $\widetilde{\mathbf{y}}_2$, ...., $\widetilde{\mathbf{y}}_n$ denotes frame level predictions. $\overline{\mathbf{y}}_{\mathbf{1}}$, $\overline{\mathbf{y}}_2$, ..., $\overline{\mathbf{y}}_n$ refers to noisy frame-level expression labels. $\hat{{\mathbf{Y}}_1}$, ..., $\hat{{\mathbf{Y}}_N}$ represents the indirect labels"$\textbf{?}$" refers to frames with no labels.
  • Figure 4: An illustration of WSL scenarios for AU recognition in images. (a) Inexact WSL: MIL with image-level annotations. ${\boldsymbol Y}_{i}$ represents the image-level expression label for image $i$ (c) Incomplete WSL: SSL with partial AU annotations, (d) Inaccurate WSL with noisy AU annotations. ${\mathbf y}_1$, ${\mathbf y}_2$, ..., ${\mathbf y}_n$ denotes the AU labels. $\widetilde{\mathbf{y}}_1$, $\widetilde{\mathbf{y}}_2$, ..., $\widetilde{\mathbf{y}}_n$ the AU model predictions, and $\overline{\mathbf{y}}_1$, $\overline{\mathbf{y}}_2$, ..., $\overline{\mathbf{y}}_n$ the noisy AU labels. (d) Indirect WSL: WSL from implicit expression labels ${\hat{\boldsymbol Y}_{i}}$ represents the image-level expression label for image $i$. Finally, "$\textbf{?}$" refers to the case with no annotations. .
  • Figure 5: Accuracy of expression classification methods with incomplete annotations on RAF-DB (top) and FER+ (bottom) datasets. Dotted and dashed lines denote performance using Resnet-18 and WideResnet-28-2, respectively.
  • ...and 1 more figures