Table of Contents
Fetching ...

Neural Codecs as Biosignal Tokenizers

Kleanthis Avramidis, Tiantian Feng, Woojae Jeong, Jihwan Lee, Wenhui Cui, Richard M Leahy, Shrikanth Narayanan

TL;DR

BioCodec tackles the challenge of decoding noisy biosignals by learning discrete, low-level representations through residual vector quantization, forming a codec-based foundation model for EEG (and EMG). It forgoes predefined semantic tokens and instead tokenizes continuous waveforms, enabling channel-agnostic representations that generalize across diverse downstream tasks. Pre-trained on thousands of EEG hours, BioCodec delivers competitive or superior performance while using far fewer downstream parameters and achieving up to eightfold input compression, and experiments show robust performance in low-resource settings and across modalities. Analyses of codebook usage and spatial coherence reveal a meaningful, hierarchical latent structure that preserves essential signal properties while discarding noise, supporting practical deployment in clinical and consumer-biosignal contexts.

Abstract

Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

Neural Codecs as Biosignal Tokenizers

TL;DR

BioCodec tackles the challenge of decoding noisy biosignals by learning discrete, low-level representations through residual vector quantization, forming a codec-based foundation model for EEG (and EMG). It forgoes predefined semantic tokens and instead tokenizes continuous waveforms, enabling channel-agnostic representations that generalize across diverse downstream tasks. Pre-trained on thousands of EEG hours, BioCodec delivers competitive or superior performance while using far fewer downstream parameters and achieving up to eightfold input compression, and experiments show robust performance in low-resource settings and across modalities. Analyses of codebook usage and spatial coherence reveal a meaningful, hierarchical latent structure that preserves essential signal properties while discarding noise, supporting practical deployment in clinical and consumer-biosignal contexts.

Abstract

Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

Paper Structure

This paper contains 41 sections, 13 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The BioCodec framework is pre-trained on single-channel biosignals via a neural codec which comprises a SEANet li2021real autoencoder and residual vector quantization (RVQ). Quantized embeddings (with quantization error $\mathcal{L}_{\text{w}}$) are pre-trained for signal reconstruction on time ($\mathcal{L}_{\text{t}}$) and frequency (multiscale $\mathcal{L}_{\text{f}}$) domain. For downstream inference, they are fed into two single-layer transformers across time (T) and channels (M) with an MLP head.
  • Figure 2: Distribution of code indices in the TUAB dataset (10000 samples), across the RVQ layers. All 256 codes are being utilized in every layer, with significantly high diversity in usage.
  • Figure 3: Spatial coherence analysis showing aggregate Spearman correlation between EEG and RVQ connectivity matrices (left) and example pair of connectivity matrices (right).
  • Figure 4: Example reconstructed EEG waveforms from the pre-trained BioCodec decoder.
  • Figure 5: Example stimuli and reconstruction mel-spectrograms selected from the N400 dataset. Top: ground truth audio speech stimuli. Bottom: model-based estimations using BioCodec.
  • ...and 2 more figures