DEAP DIVE: Dataset Investigation with Vision transformers for EEG evaluation
Annemarie Hoffsommer, Helen Schneider, Svetlana Pavlitska, J. Marius Zöllner
TL;DR
This work demonstrates that emotion recognition from EEG can be effectively achieved using only a subset of channels by converting channel signals into scaleograms via Continuous Wavelet Transform and classifying with a Vision Transformer. The study shows 12-channel configurations, particularly Emotiv subsets, approaching state-of-the-art accuracy (around 91.5%) with substantially fewer inputs than traditional 32-channel setups, and even single-channel eye-movement signals can yield meaningful predictive performance. It also provides an initial baseline for regression on DEAP using EEG data with a reported RMSE around 0.57–0.98 across configurations, and discusses labeling scheme effects (VAQ vs SAM) and the interpretability challenges of channel contributions. These findings support the feasibility of portable, low-cost EEG systems for affective computing and outline future work on explainable AI and cross-device generalization.
Abstract
Accurately predicting emotions from brain signals has the potential to achieve goals such as improving mental health, human-computer interaction, and affective computing. Emotion prediction through neural signals offers a promising alternative to traditional methods, such as self-assessment and facial expression analysis, which can be subjective or ambiguous. Measurements of the brain activity via electroencephalogram (EEG) provides a more direct and unbiased data source. However, conducting a full EEG is a complex, resource-intensive process, leading to the rise of low-cost EEG devices with simplified measurement capabilities. This work examines how subsets of EEG channels from the DEAP dataset can be used for sufficiently accurate emotion prediction with low-cost EEG devices, rather than fully equipped EEG-measurements. Using Continuous Wavelet Transformation to convert EEG data into scaleograms, we trained a vision transformer (ViT) model for emotion classification. The model achieved over 91,57% accuracy in predicting 4 quadrants (high/low per arousal and valence) with only 12 measuring points (also referred to as channels). Our work shows clearly, that a significant reduction of input channels yields high results compared to state-of-the-art results of 96,9% with 32 channels. Training scripts to reproduce our code can be found here: https://gitlab.kit.edu/kit/aifb/ATKS/public/AutoSMiLeS/DEAP-DIVE.
