REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge
Siyang Song, Micol Spitale, Cheng Luo, Cristina Palmero, German Barquero, Hengde Zhu, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, Elisabeth Andre, Hatice Gunes
TL;DR
REACT 2024 tackles the generation of multiple appropriate facial reactions in response to a speaker within video-conference settings, formalizing offline and online sub-challenges and evaluating outputs as both facial attribute time-series and video sequences. The approach centers on three baselines—Trans-VAE, BeLFusion, and REGNN—each modeling the one-to-many mapping from speaker behavior to listener reactions through distinct architectures (transformer-based, diffusion-enabled, and reversible graph neural frameworks). Evaluation hinges on a multi-faceted metric suite assessing appropriateness, diversity, realism, and synchrony, leveraging FRCorr, FRDist, FRDiv, FRVar, FRDvs, FRSyn, and FRRea, using NoXi and RECOLA data with 25 attributes per frame. The study demonstrates that REGNN excels in appropriateness and synchrony offline, BeLFusion boosts diversity and realism, and all baselines outperform random, providing a solid benchmark and a path forward for realistic, diverse facial reaction generation in multimodal human-computer interaction scenarios.
Abstract
In dyadic interactions, humans communicate their intentions and state of mind using verbal and non-verbal cues, where multiple different facial reactions might be appropriate in response to a specific speaker behaviour. Then, how to develop a machine learning (ML) model that can automatically generate multiple appropriate, diverse, realistic and synchronised human facial reactions from an previously unseen speaker behaviour is a challenging task. Following the successful organisation of the first REACT challenge (REACT 2023), this edition of the challenge (REACT 2024) employs a subset used by the previous challenge, which contains segmented 30-secs dyadic interaction clips originally recorded as part of the NOXI and RECOLA datasets, encouraging participants to develop and benchmark Machine Learning (ML) models that can generate multiple appropriate facial reactions (including facial image sequences and their attributes) given an input conversational partner's stimulus under various dyadic video conference scenarios. This paper presents: (i) the guidelines of the REACT 2024 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2024.
