Table of Contents
Fetching ...

A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models

SungJun Cho, Chetan Gohil, Rukuang Huang, Oiwi Parker Jones, Mark W. Woolrich

TL;DR

This work systematically evaluates sample-level tokenization strategies for MEG foundation models, introducing a learnable autoencoder-based tokenizer and comparing it to fixed baselines across three public MEG datasets. Using a GPT-style MEG-GPT, the study analyzes reconstruction fidelity, token prediction, synthetic data quality, subject fingerprinting, and downstream decoding, finding that simple fixed discretization generally matches learnable approaches. Learnable tokenization yields modest advantages in preserving subject-specific information and in end-to-end fine-tuning for decoding, but overall performance is broadly similar across tokenizers. The results suggest that straightforward, non-learnable tokenization can be a practical, scalable choice for neural time-series foundation models, with learnable approaches preferred when subject individuality or high-fidelity generative structure is a priority.

Abstract

Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.

A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models

TL;DR

This work systematically evaluates sample-level tokenization strategies for MEG foundation models, introducing a learnable autoencoder-based tokenizer and comparing it to fixed baselines across three public MEG datasets. Using a GPT-style MEG-GPT, the study analyzes reconstruction fidelity, token prediction, synthetic data quality, subject fingerprinting, and downstream decoding, finding that simple fixed discretization generally matches learnable approaches. Learnable tokenization yields modest advantages in preserving subject-specific information and in end-to-end fine-tuning for decoding, but overall performance is broadly similar across tokenizers. The results suggest that straightforward, non-learnable tokenization can be a practical, scalable choice for neural time-series foundation models, with learnable approaches preferred when subject individuality or high-fidelity generative structure is a priority.

Abstract

Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.
Paper Structure (38 sections, 9 equations, 10 figures, 1 table)

This paper contains 38 sections, 9 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overview of the foundation modeling framework and tokenizer architecture. (a) Schematic of the full generative training pipeline for the MEG-GPT foundation model. (b) Architecture of the learnable MEG tokenizer.
  • Figure 2: MEG-GPT foundation model architecture. (a) High-level overview of the model architecture. (b) Detailed structure of the transformer decoder component.
  • Figure 3: Convergence behavior of MEG-GPT models across tokenizer types on training and validation datasets. $n$ denotes the vocabulary size (i.e., number of unique tokens) for each tokenizer.
  • Figure 4: Distributions of input token counts for different tokenization methods. Token indices are sorted in descending order of frequency, and the value $n$ denotes the vocabulary size for each tokenizer.
  • Figure 5: Reconstruction accuracy and zero-shot generalization performance of different tokenizers. (a) Subject-level PVE between original MEG recordings and data reconstructed from tokens for the Cam-CAN training and test datasets. (b) Same analysis as in (a) evaluated on previously unseen datasets acquired under a task paradigm or using a different MEG scanner. Here, $N$ denotes the number of subjects, and $n$ the number of unique tokens used by each tokenizer.
  • ...and 5 more figures