Table of Contents
Fetching ...

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu

Abstract

Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.

NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

Abstract

Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
Paper Structure (29 sections, 7 equations, 8 figures, 5 tables)

This paper contains 29 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustrative example of segment-level, clinically grounded EEG interpretation. In contrast to coarse recording-level interpretaion, this sample demonstrates the generation of a fine-grained clinical narrative for a 10-second segment. The generated text systematically synthesizes four dimensions of electrophysiological analysis: (i) Event & Clinical Labels, identifying specific morphological abnormalities (e.g., spike-and-slow-wave complexes); (ii) Spatial Energy Distribution, capturing prominent signal and energy variations in specific anatomical regions (e.g., right fronto-temporal); (iii) Frequency-Domain Features, identifying the dominant spectral bands; and (iv) Temporal Context, characterizing the non-stationary evolution of brain states relative to preceding segments.
  • Figure 2: NeuroNarrator architecture for spectro–spatially grounded and temporally coherent EEG-to-text generation. (a) Dual-stream spectro–spatial grounding encodes each EEG segment using a pretrained EEG encoder operating on multichannel waveforms and a frozen vision encoder processing the corresponding scalp topographic map. Modality-specific features are projected into a shared latent space and aligned via a contrastive objective, enforcing correspondence between spectral dynamics and spatial energy distributions. (b) State-space–inspired generative modeling conditions text generation on both the aligned spectro–spatial embedding of the current segment and a short trajectory of preceding EEG segments, serving as a proxy for latent brain-state evolution. These continuous embeddings are injected as soft prompt tokens, replacing designated placeholder positions in the language-model prompt alongside task instructions, enabling the synthesis of clinically grounded narratives that preserve waveform morphology, dominant frequency structure, spatial localization, and temporal dynamics.
  • Figure 3: Overview of NeuroCorpus-160K construction. (a) Distribution of the aggregated datasets across major clinical domains. (b) The unified data processing workflow, which transforms raw recordings into clinically grounded narratives via three stages: signal preprocessing, structured feature extraction, and LLM-driven description refinement.
  • Figure 4: Visualization of the learned spectro-spatial manifold via t-SNEmaaten2008visualizing projection. The plots depict the latent distribution of EEG segments (triangles) and corresponding topographic maps (circles) sampled from the held-out NeuroCorpus-160K evaluation split. Left: In the absence of contrastive alignment, the representations exhibit a distinct modality gap, with temporal and spatial features forming fragmented, disjoint clusters. Right: Following contrastive optimization, the embedding space demonstrates a rigorous topological correspondence; matched EEG and topographic map pairs are tightly co-located within dataset-specific clusters.
  • Figure 5: Statistical validation of the learned spectro-spatial metric space. Left: Cosine similarity matrix. The pronounced diagonal dominance confirms a rigorous one-to-one correspondence between temporal EEG embeddings and spatial topographic map embeddings, effectively minimizing off-diagonal ambiguity. Right: Distribution of cosine similarity scores for matched versus mismatched pairs.
  • ...and 3 more figures