NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Guoan Wang; Shihao Yang; Jun-en Ding; Hao Zhu; Feng Liu

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu

Abstract

Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Abstract

Paper Structure (29 sections, 7 equations, 8 figures, 5 tables)

This paper contains 29 sections, 7 equations, 8 figures, 5 tables.

Introduction
Related Work
Generalist EEG Representation Learning
EEG-to-Text Generation and Multimodal Alignment
Spectro-Spatial and Temporal Modeling in Neuroscience
Methods
Overview of the NeuroNarrator Framework
Construction of a Clinically Grounded EEG–Text Corpus
Multi-Source EEG Data Harmonization
Unified Signal Preprocessing
Structured Feature Extraction and LLM-Driven Description Refinement
State-Space-Inspired Temporal Context Modeling
Spectro-Spatial Representation Learning
Dual-Stream Spectro-Spatial Encoding
Contrastive Spectro-Spatial Alignment
...and 14 more sections

Figures (8)

Figure 1: Illustrative example of segment-level, clinically grounded EEG interpretation. In contrast to coarse recording-level interpretaion, this sample demonstrates the generation of a fine-grained clinical narrative for a 10-second segment. The generated text systematically synthesizes four dimensions of electrophysiological analysis: (i) Event & Clinical Labels, identifying specific morphological abnormalities (e.g., spike-and-slow-wave complexes); (ii) Spatial Energy Distribution, capturing prominent signal and energy variations in specific anatomical regions (e.g., right fronto-temporal); (iii) Frequency-Domain Features, identifying the dominant spectral bands; and (iv) Temporal Context, characterizing the non-stationary evolution of brain states relative to preceding segments.
Figure 2: NeuroNarrator architecture for spectro–spatially grounded and temporally coherent EEG-to-text generation. (a) Dual-stream spectro–spatial grounding encodes each EEG segment using a pretrained EEG encoder operating on multichannel waveforms and a frozen vision encoder processing the corresponding scalp topographic map. Modality-specific features are projected into a shared latent space and aligned via a contrastive objective, enforcing correspondence between spectral dynamics and spatial energy distributions. (b) State-space–inspired generative modeling conditions text generation on both the aligned spectro–spatial embedding of the current segment and a short trajectory of preceding EEG segments, serving as a proxy for latent brain-state evolution. These continuous embeddings are injected as soft prompt tokens, replacing designated placeholder positions in the language-model prompt alongside task instructions, enabling the synthesis of clinically grounded narratives that preserve waveform morphology, dominant frequency structure, spatial localization, and temporal dynamics.
Figure 3: Overview of NeuroCorpus-160K construction. (a) Distribution of the aggregated datasets across major clinical domains. (b) The unified data processing workflow, which transforms raw recordings into clinically grounded narratives via three stages: signal preprocessing, structured feature extraction, and LLM-driven description refinement.
Figure 4: Visualization of the learned spectro-spatial manifold via t-SNEmaaten2008visualizing projection. The plots depict the latent distribution of EEG segments (triangles) and corresponding topographic maps (circles) sampled from the held-out NeuroCorpus-160K evaluation split. Left: In the absence of contrastive alignment, the representations exhibit a distinct modality gap, with temporal and spatial features forming fragmented, disjoint clusters. Right: Following contrastive optimization, the embedding space demonstrates a rigorous topological correspondence; matched EEG and topographic map pairs are tightly co-located within dataset-specific clusters.
Figure 5: Statistical validation of the learned spectro-spatial metric space. Left: Cosine similarity matrix. The pronounced diagonal dominance confirms a rigorous one-to-one correspondence between temporal EEG embeddings and spatial topographic map embeddings, effectively minimizing off-diagonal ambiguity. Right: Distribution of cosine similarity scores for matched versus mismatched pairs.
...and 3 more figures

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Abstract

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Authors

Abstract

Table of Contents

Figures (8)