Table of Contents
Fetching ...

Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

Iosif Tsangko, Andreas Triantafyllopoulos, Adem Abdelmoula, Adria Mallol-Ragolta, Bjoern W. Schuller

TL;DR

The paper interrogates how vision-language foundation models perform zero-shot facial emotion recognition by examining visually driven proxies such as teeth visibility. Using a teeth-annotated AffectNet subset and prompt-based introspection, it reveals that GPT-4o’s predictions align strongly with interpretable facial cues (notably eyebrow position and perceived teeth), achieving the highest performance but exhibiting shortcut biases when teeth cues are unreliable. The study quantifies how much of the affective space is explained by these cues (R^2 ≈ 0.72 for valence, 0.77 for arousal) and demonstrates how false positives for teeth can distort predictions toward happiness. These findings underscore the emergent but potentially brittle nature of FM-based FER, highlighting risks for bias and fairness and the need for transparent evaluation and demographic validation in affective AI applications.

Abstract

Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

TL;DR

The paper interrogates how vision-language foundation models perform zero-shot facial emotion recognition by examining visually driven proxies such as teeth visibility. Using a teeth-annotated AffectNet subset and prompt-based introspection, it reveals that GPT-4o’s predictions align strongly with interpretable facial cues (notably eyebrow position and perceived teeth), achieving the highest performance but exhibiting shortcut biases when teeth cues are unreliable. The study quantifies how much of the affective space is explained by these cues (R^2 ≈ 0.72 for valence, 0.77 for arousal) and demonstrates how false positives for teeth can distort predictions toward happiness. These findings underscore the emergent but potentially brittle nature of FM-based FER, highlighting risks for bias and fairness and the need for transparent evaluation and demographic validation in affective AI applications.

Abstract

Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

Paper Structure

This paper contains 18 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example facial images for each of the seven emotion categories used in the teeth-annotation from mollahosseini17-AFF. Each column shows one image with visible teeth (top row) and one without (bottom row).
  • Figure 2: The choice of this specific structured prompt is driven by the need to standardise the model outputs and ensure that the data collected from GPT-4o is both consistent and meaningful. For the rest of VLM, the prompt included only item 1.
  • Figure 3: Performance of a trained random forest classifier that predicts GPT's categorical emotion labels using only its predicted valence and arousal values
  • Figure 4: Valence and arousal (y-axis) distributions over facial features (x-axis) in GPT-4o.
  • Figure 5: UAR for emotion classification performance across various VLM, grouped by teeth visibility. Baselines from ViT-FER across teeth visibility states are shown as reference lines.
  • ...and 2 more figures