Table of Contents
Fetching ...

NeuroCLIP: Brain-Inspired Prompt Tuning for EEG-to-Image Multimodal Contrastive Learning

Jiyuan Wang, Li Zhang, Haipeng Lin, Qile Liu, Gan Huang, Ziyu Li, Zhen Liang, Xia Wu

TL;DR

NeuroCLIP addresses the challenge of aligning EEG signals with visual semantics by rethinking CLIP-based cross-modal learning through brain-inspired prompt tuning. It introduces a dual-stream visual embedding with content-adaptive dynamic filtering and a two-level prompting scheme (instance-level and shared-level) to adapt visual representations under neural constraints, coupled with a softened cross-modal loss that accounts for semantic ambiguity in EEG. The approach yields state-of-the-art zero-shot EEG–image retrieval on THINGS-EEG2, with strong intra- and inter-subject generalization, while maintaining efficiency through frozen backbones and lightweight trainable modules. This work highlights the potential of physiology-aware prompt tuning to bridge brain signals and visual semantics, opening pathways for more flexible and scalable brain–computer interface applications.

Abstract

Recent advances in brain-inspired artificial intelligence have sought to align neural signals with visual semantics using multimodal models such as CLIP. However, existing methods often treat CLIP as a static feature extractor, overlooking its adaptability to neural representations and the inherent physiological-symbolic gap in EEG-image alignment. To address these challenges, we present NeuroCLIP, a prompt tuning framework tailored for EEG-to-image contrastive learning. Our approach introduces three core innovations: (1) We design a dual-stream visual embedding pipeline that combines dynamic filtering and token-level fusion to generate instance-level adaptive prompts, which guide the adjustment of patch embedding tokens based on image content, thereby enabling fine-grained modulation of visual representations under neural constraints; (2) We are the first to introduce visual prompt tokens into EEG-image alignment, acting as global, modality-level prompts that work in conjunction with instance-level adjustments. These visual prompt tokens are inserted into the Transformer architecture to facilitate neural-aware adaptation and parameter optimization at a global level; (3) Inspired by neuroscientific principles of human visual encoding, we propose a refined contrastive loss that better model the semantic ambiguity and cross-modal noise present in EEG signals. On the THINGS-EEG2 dataset, NeuroCLIP achieves a Top-1 accuracy of 63.2% in zero-shot image retrieval, surpassing the previous best method by +12.3%, and demonstrates strong generalization under inter-subject conditions (+4.6% Top-1), highlighting the potential of physiology-aware prompt tuning for bridging brain signals and visual semantics.

NeuroCLIP: Brain-Inspired Prompt Tuning for EEG-to-Image Multimodal Contrastive Learning

TL;DR

NeuroCLIP addresses the challenge of aligning EEG signals with visual semantics by rethinking CLIP-based cross-modal learning through brain-inspired prompt tuning. It introduces a dual-stream visual embedding with content-adaptive dynamic filtering and a two-level prompting scheme (instance-level and shared-level) to adapt visual representations under neural constraints, coupled with a softened cross-modal loss that accounts for semantic ambiguity in EEG. The approach yields state-of-the-art zero-shot EEG–image retrieval on THINGS-EEG2, with strong intra- and inter-subject generalization, while maintaining efficiency through frozen backbones and lightweight trainable modules. This work highlights the potential of physiology-aware prompt tuning to bridge brain signals and visual semantics, opening pathways for more flexible and scalable brain–computer interface applications.

Abstract

Recent advances in brain-inspired artificial intelligence have sought to align neural signals with visual semantics using multimodal models such as CLIP. However, existing methods often treat CLIP as a static feature extractor, overlooking its adaptability to neural representations and the inherent physiological-symbolic gap in EEG-image alignment. To address these challenges, we present NeuroCLIP, a prompt tuning framework tailored for EEG-to-image contrastive learning. Our approach introduces three core innovations: (1) We design a dual-stream visual embedding pipeline that combines dynamic filtering and token-level fusion to generate instance-level adaptive prompts, which guide the adjustment of patch embedding tokens based on image content, thereby enabling fine-grained modulation of visual representations under neural constraints; (2) We are the first to introduce visual prompt tokens into EEG-image alignment, acting as global, modality-level prompts that work in conjunction with instance-level adjustments. These visual prompt tokens are inserted into the Transformer architecture to facilitate neural-aware adaptation and parameter optimization at a global level; (3) Inspired by neuroscientific principles of human visual encoding, we propose a refined contrastive loss that better model the semantic ambiguity and cross-modal noise present in EEG signals. On the THINGS-EEG2 dataset, NeuroCLIP achieves a Top-1 accuracy of 63.2% in zero-shot image retrieval, surpassing the previous best method by +12.3%, and demonstrates strong generalization under inter-subject conditions (+4.6% Top-1), highlighting the potential of physiology-aware prompt tuning for bridging brain signals and visual semantics.

Paper Structure

This paper contains 32 sections, 22 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between the Classical and Our Proposed Visual Prompt Tuning Paradigms
  • Figure 2: The NeuroCLIP framework. EEG signals are perturbed and encoded; Images are processed through a Dual-Stream Visual Embedding with a Dynamic Filter Layer (DFL). Instance-specific cues are injected by Cross-Attention Token-level Fusion (CATF), and Two-Level Visual Prompt Learning introduces both instance-level and shared-level prompts into the frozen CLIP-VIT. EEG–Image embeddings are then projected and aligned for cross-modal retrieval.
  • Figure 3: Comparison of average Top-1 and Top-5 accuracy across different methods under (a) intra-subject and (b) inter-subject settings on the THINGS-EEG2 dataset.
  • Figure 4: Ablation study on temporal segment
  • Figure 5: Visualization of different Encoders performance
  • ...and 4 more figures