Table of Contents
Fetching ...

Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

Xiaozhao Liu, Dinggang Shen, Xihui Liu

TL;DR

This work tackles the reliability of EEG-to-text decoding by addressing posterior collapse, reframing decoding as semantic summarization, and introducing the Generative Language Inspection Model (GLIM). GLIM combines a domain-adaptive EEG encoder, a frozen encoder-decoder language model, and a cross-modal querying aligner to align EEG representations with the LM latent space, use multiple paraphrased targets, and regularize with a cross-modal contrastive objective. The approach yields fluent, EEG-grounded sentences without teacher forcing and enables robust semantic evaluation via EEG-text retrieval and zero-shot classification across sentiment, relation types, and topics, demonstrating strong generalization across heterogeneous domains on the ZuCo dataset. This work lays groundwork for reliable, scalable benchmarking in generative brain decoding and points toward practical non-invasive language BCI systems through improved semantic grounding. The key contribution is a modular, interpretable framework that mitigates posterior collapse and emphasizes high-level semantic alignment over surface-level lexical reconstruction, supported by comprehensive semantic evaluations beyond traditional text similarity metrics.

Abstract

Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.

Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

TL;DR

This work tackles the reliability of EEG-to-text decoding by addressing posterior collapse, reframing decoding as semantic summarization, and introducing the Generative Language Inspection Model (GLIM). GLIM combines a domain-adaptive EEG encoder, a frozen encoder-decoder language model, and a cross-modal querying aligner to align EEG representations with the LM latent space, use multiple paraphrased targets, and regularize with a cross-modal contrastive objective. The approach yields fluent, EEG-grounded sentences without teacher forcing and enables robust semantic evaluation via EEG-text retrieval and zero-shot classification across sentiment, relation types, and topics, demonstrating strong generalization across heterogeneous domains on the ZuCo dataset. This work lays groundwork for reliable, scalable benchmarking in generative brain decoding and points toward practical non-invasive language BCI systems through improved semantic grounding. The key contribution is a modular, interpretable framework that mitigates posterior collapse and emphasizes high-level semantic alignment over surface-level lexical reconstruction, supported by comprehensive semantic evaluations beyond traditional text similarity metrics.

Abstract

Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.

Paper Structure

This paper contains 56 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of GLIM.(A) One typical experimental session in natural reading dataset hollenstein2018zuco involves a task-specific instruction followed by sentence stimulus blocks and comprehension queries. Participants read at their own speed and the simultaneously recorded EEG signals are segmented to aligned with each sentence, forming the EEG-text pairs for downstream decoding studies. We consider that several factors—including task-specific goals, uneven attention, and the limited signal-noise ratio (SNR) of EEG—can collectively introduce non-negligible information mismatch between stimulus texts and EEG signals; while the small-scale data and domain heterogeneity further challenge data-driven models to converge and generalize. (B) GLIM acknowledges these practical constrains and reframes EEG-to-text as summarization task, targeting semantically faithful rather than syntactically mimetic sentence generation. It focus on the effectiveness and interpretability of EEG decoding in heterogeneous dataset, approached by end-to-end learning informative EEG representations that well-aligned with the high-level representations of a fully frozen pretrained language model.
  • Figure 2: Architecture and training objective of GLIM. It consists of three modules: a domain-adaptive EEG encoder, a pretrained encoder-decoder language model (LM), and a cross-modal querying aligner. We train the EEG encoder and the querying aligner to align EEG representations with the latent space of the frozen LM. There are two forms of EEG representations: (1) a token-level sequence representation $Z_i$, used to generate sentences by conditioning the LM decoder; and (2) a global embedding $e_i^x$, enabling EEG-text retrieval and zero-shot semantic classification.
  • Figure 3: Representative examples of generated texts. The three groups correspond to NR-SST, NR-Wiki, and TSR-Wiki, each showing the raw stimulus and two generated texts from different subjects. We observe: (1) corpus distinctions are regularly captured (movie reviews in SST vs. personal bios in Wiki); (2) relation types are expressed diversely, especially in the TSR group (e.g., the "education" label is paraphrased as "educated" and "studied"); (3) hallucinations mainly involve contradictory logic and irrelevant content; and (4) repetitive sentence patterns appear but differ across corpus topics ("The movie..." vs. "He was...").
  • Figure 4: (A) Average generation scores per variant type. Light bars denote average BLEU-1 and ROUGE-1 scores under noise input tests ($\mathcal{N}_{in}$); while dark bars show absolute improvements over the averaged $\mathcal{N}_{in}$ scores (i.e., $\textit{Score}-\textit{Score}_{{\mathcal{N}}_{\textit{in}}}$). (a) Six LLM-generated variant types. (b) Two types generated by the integrated LM. (c) Baseline references calculated with raw stimulus texts or with all 8 variants (i.e., our main results). (B) Heatmap of pairwise $p$-values for variant comparisons. The diagonal blocks represent comparisons within the same variant type, while the other blocks illustrate the inter-type comparisons. The $p$-values are calculated using the absolute-improvement scores; a value of $p<0.05$ indicates a significant difference.
  • Figure 5: Performance comparison across different groups. The five groups correspond to various experimental conditions in the ZuCo dataset, with each dot representing the average metric for each subject.