Modeling Face Emotion Perception from Naturalistic Face Viewing: Insights from Fixational Events and Gaze Strategies
Meisam J. Seikavandi, Maria J. Barrett, Paolo Burelli
TL;DR
This work investigates how eye movements during naturalistic face viewing relate to emotion perception under an instructionless FER paradigm with two processes: free viewing and grounded FER. It combines fixational, microsaccadic, and pupillary features with deep-face embeddings and sequential models (including a bidirectional LSTM) to predict emotion perception performance and dwell-time dynamics across three tasks, using a randomization-based modification of a classical instructionless FER task. The study reveals that early gaze patterns can predict later FER success, identifies emotion- and region-specific gaze differences, and demonstrates a modeling pipeline that improves prediction using temporal and spatiotemporal gaze features, though cross-user generalization remains challenging. By providing a standardized GEI-attentive framework and a dataset/tool for cross-dataset comparability, the work advances ecologically valid emotion recognition and informs applications in psychology, HCI, and affective computing.
Abstract
Face Emotion Recognition (FER) is essential for social interactions and understanding others' mental states. Utilizing eye tracking to investigate FER has yielded insights into cognitive processes. In this study, we utilized an instructionless paradigm to collect eye movement data from 21 participants, examining two FER processes: free viewing and grounded FER. We analyzed fixational, pupillary, and microsaccadic events from eye movements, establishing their correlation with emotion perception and performance in the grounded task. By identifying regions of interest on the face, we explored the impact of eye-gaze strategies on face processing, their connection to emotions, and performance in emotion perception. During free viewing, participants displayed specific attention patterns for various emotions. In grounded tasks, where emotions were interpreted based on words, we assessed performance and contextual understanding. Notably, gaze patterns during free viewing predicted success in grounded FER tasks, underscoring the significance of initial gaze behavior. We also employed features from pre-trained deep-learning models for face recognition to enhance the scalability and comparability of attention analysis during free viewing across different datasets and populations. This method facilitated the prediction and modeling of individual emotion perception performance from minimal observations. Our findings advance the understanding of the link between eye movements and emotion perception, with implications for psychology, human-computer interaction, and affective computing, and pave the way for developing precise emotion recognition systems.
