Table of Contents
Fetching ...

ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation

Rita Frieske, Bertram E. Shi

TL;DR

ERIT addresses the need for aging populations in emotion recognition and lightweight multimodal fusion by providing a text-plus-image dataset with frame-level seven-emotion labels derived from ElderReact videos. The authors assemble and validate the data using Whisper ASR, DeepFace-based frame selection, and ground-truth ElderReact labels to support reliable fusion benchmarking. They evaluate multiple large language models and fusion strategies, demonstrating that multimodal fusion outperforms single-modality approaches for elderly emotion recognition. The dataset's public availability and validation make it a valuable benchmark for researchers and practitioners in healthcare, assistive technologies, and multimodal ML.

Abstract

ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research.

ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation

TL;DR

ERIT addresses the need for aging populations in emotion recognition and lightweight multimodal fusion by providing a text-plus-image dataset with frame-level seven-emotion labels derived from ElderReact videos. The authors assemble and validate the data using Whisper ASR, DeepFace-based frame selection, and ground-truth ElderReact labels to support reliable fusion benchmarking. They evaluate multiple large language models and fusion strategies, demonstrating that multimodal fusion outperforms single-modality approaches for elderly emotion recognition. The dataset's public availability and validation make it a valuable benchmark for researchers and practitioners in healthcare, assistive technologies, and multimodal ML.

Abstract

ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research.
Paper Structure (6 sections, 2 figures, 3 tables)

This paper contains 6 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Examples of frames labeled with different emotions from ERIT.
  • Figure 2: Percentage of different emotion labels among ERIT test, dev, and train splits.