Table of Contents
Fetching ...

CAMEL-CLIP: Channel-aware Multimodal Electroencephalography-text Alignment for Generalizable Brain Foundation Models

Hanseul Choi, Jinyeong Park, Seongwon Jin, Sungho Park, Jibum Kim

Abstract

Electroencephalography (EEG) foundation models have shown promise for learning generalizable representations, yet they remain sensitive to channel heterogeneity, such as changes in channel composition or ordering. We propose channel-aware multimodal EEG-text alignment contrastive language-image pretraining (CAMEL-CLIP), a contrastive EEG-text multimodal foundation model designed to be robust to heterogeneous channel configurations and widely applicable to diverse downstream tasks. CAMEL-CLIP introduces three key components: (1) channel attribute-based positional encoding, which identifies channels through semantic information; (2) dynamic channel projection, which generates variable-length embeddings by independently projecting each channel without feature compression; and (3) dual-level contrastive learning, which jointly performs channel-level and sample-level contrastive learning to capture both channel-specific and global signal characteristics. Experimental results demonstrate that CAMEL-CLIP achieves state-of-the-art performance under linear-probing and outperforms existing foundation models that rely on full-finetuning.

CAMEL-CLIP: Channel-aware Multimodal Electroencephalography-text Alignment for Generalizable Brain Foundation Models

Abstract

Electroencephalography (EEG) foundation models have shown promise for learning generalizable representations, yet they remain sensitive to channel heterogeneity, such as changes in channel composition or ordering. We propose channel-aware multimodal EEG-text alignment contrastive language-image pretraining (CAMEL-CLIP), a contrastive EEG-text multimodal foundation model designed to be robust to heterogeneous channel configurations and widely applicable to diverse downstream tasks. CAMEL-CLIP introduces three key components: (1) channel attribute-based positional encoding, which identifies channels through semantic information; (2) dynamic channel projection, which generates variable-length embeddings by independently projecting each channel without feature compression; and (3) dual-level contrastive learning, which jointly performs channel-level and sample-level contrastive learning to capture both channel-specific and global signal characteristics. Experimental results demonstrate that CAMEL-CLIP achieves state-of-the-art performance under linear-probing and outperforms existing foundation models that rely on full-finetuning.
Paper Structure (27 sections, 4 equations, 9 figures, 12 tables)

This paper contains 27 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: UMAP visualization of embedding vectors for each model. (a) EEG-CLIP embeddings. (b) CAMEL-CLIP channel-wise embeddings. Unlike the baseline model (EEG-CLIP), whose embeddings are not separated by channel, CAMEL-CLIP forms distinct embedding distributions for different channels.
  • Figure 2: Paradigm shift in deep learning-based EEG decoding. (a) Conventional task-specific models, where a separate EEG encoder is trained for each task and dataset. (b) Prior brain foundation models, which enable multi-task transfer via pretrained weights but still require finetuning on downstream datasets due to channel heterogeneity. (c) CAMEL-CLIP, which mitigates channel heterogeneity and supports multiple tasks with linear-probing alone.
  • Figure 3: Framework of the proposed model. (a) Channel attribute-based positional encoding. (b) Dynamic channel projection for channel-wise embeddings. (c) dual-level contrastive learning at the channel and sample levels. Here, A1, FP2, P4, FP3, and T6 annotate example channel names.
  • Figure 4: Pipeline of proposed synthetic report ensemble prompt generation. (a) The proposed text prompt generation method for text-based classification generates multiple synthetic reports and averages their embeddings. (b) An example of generated synthetic report.
  • Figure 5: Cosine similarity distributions across prompting strategies. Cosine similarity between prompts for normal labels in the pathological task and text embeddings from the validation set. The prompts from the proposed ensemble method show a higher mean cosine similarity.
  • ...and 4 more figures