Table of Contents
Fetching ...

Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

Anderson Augusma, Dominique Vaufreydaz, Fédérique Letué

Abstract

Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

Abstract

Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

Paper Structure

This paper contains 52 sections, 5 equations, 8 figures, 22 tables.

Figures (8)

  • Figure 1: Overview of the proposed VE-MD architecture. Left: input data, consisting of video frames or a single image. Middle: the Variational Encoder (VE, green box), which learns two latent spaces: $Z_1$ for emotion recognition and $Z_2$ for joint optimization of emotion recognition and person structural representation. Right: the multi-decoder head, where the Emotion Decoder uses $Z_1$ and $Z_2$, optionally complemented by structural outputs from the body and face decoders (SR inputs). Optional modules are outlined with a dashed line.
  • Figure 2: PersonQuery decoder. Left: the latent space is processed by an auxiliary convolutional module that produces three multi-scale feature tensors, $\mathbf{F}_1$, $\mathbf{F}_2$, and $\mathbf{F}_3$, which are flattened and fed to the transformer encoder. Right: the transformer decoder receives target queries together with the encoded features and predicts the structural representation and the adjacency matrix through a multi-layer perceptron head and a fully connected head, respectively.
  • Figure 3: Heatmap decoder. At the top is the VE latent-space input, followed by a custom UNet-upsample network, and the limbs decoder is applied to predict the limb heatmap. The same process is applied to the structural representation of faces.
  • Figure 4: Overview of datasets used in this study. GAF-3.0 and VGAF at left illustrate Group Emotion Recognition (GER) datasets, followed by examples from IER datasets: DFEW, SAMSEMO, MER-MULTI, and EngageNet.
  • Figure 5: Example of automatic data annotation for structural representation. The first line corresponds to body annotations using ViTPose with 18 limb connections (COCO-style). The second line corresponds to face annotations using FaceAlignment with 20 custom limb connections.
  • ...and 3 more figures