Table of Contents
Fetching ...

EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin, Yansen Wang, Dongqi Han, Weibang Jiang, Jingyuan Li, Ryosuke Furuta, Yoichi Sato, Dongsheng Li

TL;DR

EgoBrain addresses the challenge of understanding human actions by uniting egocentric vision with brain activity. It introduces a large-scale, synchronized EEG–video dataset and the Brain-TIM Transformer-based framework with Time Interval MLP temporal embeddings to fuse modalities. Empirical results show consistent gains from multimodal fusion over unimodal baselines in cross-subject and cross-scene settings, validating the complementary nature of neural signals and visual cues. The work enables open, cross-modal research in brain-computer interfaces and real-world action understanding, with standardized preprocessing and shared protocols to foster reproducibility.

Abstract

The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.

EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

TL;DR

EgoBrain addresses the challenge of understanding human actions by uniting egocentric vision with brain activity. It introduces a large-scale, synchronized EEG–video dataset and the Brain-TIM Transformer-based framework with Time Interval MLP temporal embeddings to fuse modalities. Empirical results show consistent gains from multimodal fusion over unimodal baselines in cross-subject and cross-scene settings, validating the complementary nature of neural signals and visual cues. The work enables open, cross-modal research in brain-computer interfaces and real-world action understanding, with standardized preprocessing and shared protocols to foster reproducibility.

Abstract

The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.

Paper Structure

This paper contains 26 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The EgoBrain dataset and experimental setup.a (Left) Acoustic isolation chamber with adjustable lighting and modular workstation containing standardized interaction objects. (Right) Portable apparatus configuration showing helmet-mounted GoPro camera and Emotive FLEX 2 Gel EEG headset. b High-fidelity egocentric video recording hand-object interactions and 32-channel EEG signals. c Subject performing ("Read book") action following on-screen textual prompts. d From command display ("Play Cube") to object interaction and completion confirmation.
  • Figure 2: The EgoBrain statistics. The total duration per category is presented, highlighting the longest duration (Play(II) puzzle: 4.29 hours) and the shortest duration (Drink Bitter Juice: 0.49 hours)
  • Figure 3: The overall architecture of Brain-TIM. The model processes synchronized visual and EEG signals using modality-specific encoders, followed by embedding layers to obtain token sequences. The shared temporal axis is concurrently encoded by the TIM module. A modality-aware $\texttt{CLS}$ token is appended to the sequence to capture global semantics. The resulting tokens are fed into a Transformer encoder for downstream action classification.
  • Figure 4: Confusion matrix for verb classification: unimodal vs. multimodal.a Visual-only classification. b Visual-EEG fusion. EEG improves"Play(I)" accuracy (0.46→0.64), and boosts "Drink" (0.87→0.94), showing its essential role when vision information is vague.
  • Figure 5: Success and Failure Cases for Unimodal (Visual) and Multimodal (Visual + EEG) Models.(a) Multimodal model correctly recognizes actions that the visual-only model misses, aided by EEG. (b) EEG causes misclassification, possibly due to overlapping cognitive strategies.
  • ...and 7 more figures