Beyond Vision: How Large Language Models Interpret Facial Expressions from Valence-Arousal Values
Vaibhav Mehra, Guy Laban, Hatice Gunes
TL;DR
This work probes whether large language models can infer affective meaning from structured Valence-Arousal values without processing raw facial imagery. It conducts two experiments on IIMI and Emotic datasets using FaceChannel VA values to (i) categorize expressions into emotion labels and (ii) generate semantic descriptions, comparing multiple LLMs. Results show limited success for rigid categorization, especially for complex emotions, but robust performance for free-text affective descriptions, with semantic similarity to human annotations of 0.38–0.81 depending on model and embedding, highlighting the potential and limits of cross-modal, VA-based inference. The findings suggest LLMs can contribute to privacy-conscious affective computing through descriptive reasoning, while underscoring the need for multimodal integration and bias-aware deployment in real-world settings.
Abstract
Large Language Models primarily operate through text-based inputs and outputs, yet human emotion is communicated through both verbal and non-verbal cues, including facial expressions. While Vision-Language Models analyze facial expressions from images, they are resource-intensive and may depend more on linguistic priors than visual understanding. To address this, this study investigates whether LLMs can infer affective meaning from dimensions of facial expressions-Valence and Arousal values, structured numerical representations, rather than using raw visual input. VA values were extracted using Facechannel from images of facial expressions and provided to LLMs in two tasks: (1) categorizing facial expressions into basic (on the IIMI dataset) and complex emotions (on the Emotic dataset) and (2) generating semantic descriptions of facial expressions (on the Emotic dataset). Results from the categorization task indicate that LLMs struggle to classify VA values into discrete emotion categories, particularly for emotions beyond basic polarities (e.g., happiness, sadness). However, in the semantic description task, LLMs produced textual descriptions that align closely with human-generated interpretations, demonstrating a stronger capacity for free text affective inference of facial expressions.
