Table of Contents
Fetching ...

sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

Yijun Liu

TL;DR

This work tackles decoding linguistic content from invasive sEEG by grounding neural representations in a frozen language model. The authors propose SSENSE, which maps spectrogram-based sEEG into the CLIP sentence embedding space via an InfoNCE contrastive objective, using a frozen text encoder and per-electrode aggregation to produce 512-d embeddings. On a single-subject, naturalistic movie-watching dataset, SSENSE achieves statistically above-chance zero-shot sentence retrieval, with a no-masking configuration yielding Recall@1 ≈ 1.2% and Recall@10 ≈ 10.7% (MRR ≈ 0.0498), demonstrating the viability of foundation-model priors for neural decoding. This work motivates brain-grounded foundation models and points to future scaling across subjects, multimodal inputs, and larger language models, potentially advancing brain-computer interfaces for language understanding and assistive technologies.

Abstract

Interpreting neural activity through meaningful latent representations remains a complex and evolving challenge at the intersection of neuroscience and artificial intelligence. We investigate the potential of multimodal foundation models to align invasive brain recordings with natural language. We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model, enabling sentence-level retrieval directly from brain activity. SSENSE trains a neural encoder on spectral representations of sEEG using InfoNCE loss, without fine-tuning the text encoder. We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset. Despite limited data, SSENSE achieves promising results, demonstrating that general-purpose language representations can serve as effective priors for neural decoding.

sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

TL;DR

This work tackles decoding linguistic content from invasive sEEG by grounding neural representations in a frozen language model. The authors propose SSENSE, which maps spectrogram-based sEEG into the CLIP sentence embedding space via an InfoNCE contrastive objective, using a frozen text encoder and per-electrode aggregation to produce 512-d embeddings. On a single-subject, naturalistic movie-watching dataset, SSENSE achieves statistically above-chance zero-shot sentence retrieval, with a no-masking configuration yielding Recall@1 ≈ 1.2% and Recall@10 ≈ 10.7% (MRR ≈ 0.0498), demonstrating the viability of foundation-model priors for neural decoding. This work motivates brain-grounded foundation models and points to future scaling across subjects, multimodal inputs, and larger language models, potentially advancing brain-computer interfaces for language understanding and assistive technologies.

Abstract

Interpreting neural activity through meaningful latent representations remains a complex and evolving challenge at the intersection of neuroscience and artificial intelligence. We investigate the potential of multimodal foundation models to align invasive brain recordings with natural language. We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model, enabling sentence-level retrieval directly from brain activity. SSENSE trains a neural encoder on spectral representations of sEEG using InfoNCE loss, without fine-tuning the text encoder. We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset. Despite limited data, SSENSE achieves promising results, demonstrating that general-purpose language representations can serve as effective priors for neural decoding.

Paper Structure

This paper contains 12 sections, 2 equations, 1 figure, 1 table, 2 algorithms.

Figures (1)

  • Figure 1: Proposed SSENSE multimodal framework for aligning sEEG signals with natural language. sEEG segments, zero-padded and transformed via the superlet method into time-frequency representations, are encoded using a dedicated sEEG encoder. Sentence embeddings are obtained from the frozen text encoder of CLIP. The model is trained using a contrastive InfoNCE loss to align sEEG and text representations in a shared embedding space.