Table of Contents
Fetching ...

NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

Kang Yin, Hye-Bin Shin

TL;DR

The paper tackles the mismatch between general language models and EEG-specific reporting language by introducing NeuroLex, a lightweight EEG-domain language model trained with domain-adaptive pretraining and supervised fine-tuning on EEG reports. It demonstrates improved terminology coverage, lower perplexity, and stronger performance on information extraction and summarization, with data-efficient learning and robustness to negation and factual errors. The approach uses a two-stage pipeline (DAPT and SFT) on the HEEDB corpus and four task objectives, enabling both standalone language understanding and multimodal EEG–text decoding. The work advances interpretable, language-grounded EEG–BCI systems by providing an EEG-aware linguistic backbone tailored to clinical narratives.

Abstract

Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.

NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

TL;DR

The paper tackles the mismatch between general language models and EEG-specific reporting language by introducing NeuroLex, a lightweight EEG-domain language model trained with domain-adaptive pretraining and supervised fine-tuning on EEG reports. It demonstrates improved terminology coverage, lower perplexity, and stronger performance on information extraction and summarization, with data-efficient learning and robustness to negation and factual errors. The approach uses a two-stage pipeline (DAPT and SFT) on the HEEDB corpus and four task objectives, enabling both standalone language understanding and multimodal EEG–text decoding. The work advances interpretable, language-grounded EEG–BCI systems by providing an EEG-aware linguistic backbone tailored to clinical narratives.

Abstract

Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.

Paper Structure

This paper contains 17 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the NeuroLex training pipeline with a two-stage training process: DAPT to learn EEG-specific linguistic structure, and SFT to adapt the model toward practical text understanding and generation tasks.
  • Figure 2: A comparison of traditional feature-based EEG report generation and our proposed end-to-end description generation with model understanding.