Table of Contents
Fetching ...

Shifting Attention to You: Personalized Brain-Inspired AI Models

Stephen Chong Zhao, Yang Hu, Jason Lee, Andrew Bender, Trisha Mazumdar, Mark Wallace, David A. Tovar

TL;DR

This work demonstrates that integrating human behavioral embeddings and millisecond neural dynamics into a CLIP framework yields substantially improved alignment with human perception and neural activity. By fine-tuning CLIP with SPoSE dimensions (CLIP-HBA-Behavior) and then aligning dynamic visual representations to MEG data (CLIP-HBA-MEG), the approach achieves superior behavioral and neural RDM alignment and enables individualized models that capture participant-specific neural dynamics. The framework is scalable, interpretable, and adaptable to multisensory data, with broad implications for neuroscience, personalized medicine, and intuitive human–AI interfaces. However, it also highlights the need for diverse data to ensure robustness across degraded conditions and populations, pointing toward multisensory extensions and cognitive digital twins as future directions.

Abstract

The integration of human and artificial intelligence offers a powerful avenue for advancing our understanding of information processing, as each system provides unique computational insights. However, despite the promise of human-AI integration, current AI models are largely trained on massive datasets, optimized for population-level performance, lacking mechanisms to align their computations with individual users' perceptual semantics and neural dynamics. Here we show that integrating human behavioral insights and millisecond scale neural data within a fine tuned CLIP based model not only captures generalized and individualized aspects of perception but also over doubles behavioral performance compared to the unmodified CLIP baseline. By embedding human inductive biases and mirroring dynamic neural processes during training, personalized neural fine tuning improves predictions of human similarity judgments and tracks the temporal evolution of individual neural responses. Our work establishes a novel, interpretable framework for designing adaptive AI systems, with broad implications for neuroscience, personalized medicine, and human-computer interaction.

Shifting Attention to You: Personalized Brain-Inspired AI Models

TL;DR

This work demonstrates that integrating human behavioral embeddings and millisecond neural dynamics into a CLIP framework yields substantially improved alignment with human perception and neural activity. By fine-tuning CLIP with SPoSE dimensions (CLIP-HBA-Behavior) and then aligning dynamic visual representations to MEG data (CLIP-HBA-MEG), the approach achieves superior behavioral and neural RDM alignment and enables individualized models that capture participant-specific neural dynamics. The framework is scalable, interpretable, and adaptable to multisensory data, with broad implications for neuroscience, personalized medicine, and intuitive human–AI interfaces. However, it also highlights the need for diverse data to ensure robustness across degraded conditions and populations, pointing toward multisensory extensions and cognitive digital twins as future directions.

Abstract

The integration of human and artificial intelligence offers a powerful avenue for advancing our understanding of information processing, as each system provides unique computational insights. However, despite the promise of human-AI integration, current AI models are largely trained on massive datasets, optimized for population-level performance, lacking mechanisms to align their computations with individual users' perceptual semantics and neural dynamics. Here we show that integrating human behavioral insights and millisecond scale neural data within a fine tuned CLIP based model not only captures generalized and individualized aspects of perception but also over doubles behavioral performance compared to the unmodified CLIP baseline. By embedding human inductive biases and mirroring dynamic neural processes during training, personalized neural fine tuning improves predictions of human similarity judgments and tracks the temporal evolution of individual neural responses. Our work establishes a novel, interpretable framework for designing adaptive AI systems, with broad implications for neuroscience, personalized medicine, and human-computer interaction.

Paper Structure

This paper contains 31 sections, 11 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Schematic of the CLIP-HBA-Behavior fine-tuning process using behavioral data. The 66 SPoSE text dimensions are fed into the text encoder, producing 66 text representations $D_1 \dots D_{66}$. Concurrently, visual stimuli from the THINGS dataset are input into the vision encoder, generating their corresponding visual representations $V$. These features from two modalities are bound via a dot product projection, mapping the visual features onto each of the text dimensions to form a 66-dimensional embedding $e$ for each image. Weight-Decomposed Low-Rank Adaptation (DoRA), a parameter efficient fine-tuning (PEFT) method, is used to fine-tune the attention modules of the text and vision encoder, using SPoSE behavioral embedding as an objective.
  • Figure 2: Behavioral and neural alignment of fine-tuned CLIP-HBA-Behavior. (A) Behavioral results: Representational dissimilarity matrices (RDMs) for 48 objects, predicted by CLIP-HBA-Behavior (top) and CLIP-ViT (bottom), with Spearman rank correlations ($\rho$) of 0.78 and 0.32, respectively. (B) Neural results: Temporal correlations between model-predicted RDMs and MEG RDMs from THINGS (top left) and external datasets under varied conditions (remaining panels). CLIP-HBA-Behavior consistently outperforms CLIP-ViT in correlation and area under the curve (AUC), demonstrating enhanced neural alignment and generalizability.
  • Figure 3: Schematic of the CLIP-HBA-MEG fine-tuning process using neural signals. A Feature Reweighting Matrix, pre-optimized at initialization, dynamically computes weighted combinations of vision encoder layer activations to align with neural decoding RDMs. Temporal scalers, $\alpha_T$ and $\beta_T$, respectively, modulate the magnitude of visual feature aggregation and the binding of visual-semantic features. Dimension-wise Gaussian noise is added to the embedding post-binding with a dynamically controlled noise level, mimicking varying stability of human neural responses while preventing model training from overfitting to specific noisy time points. Predicted RDMs are compared against MEG target RDMs via a custom loss function, updating the vision encoder’s attention layers and adapting the feature reweighting matrix. The text encoder is frozen with behaviorally pre-trained weights to ensure stable semantic features of text representations.
  • Figure 4: Behavioral and neural alignment of fine-tuned CLIP-HBA-MEG. (A) Behavioral validation: Comparison of the dynamic embedding space of the CLIP-HBA-MEG model across all timepoints (purple bars) with the THINGS behavioral data of 48 sample objects. The static behavioral alignment of the baseline CLIP-ViT model (green line) and the behaviorally fine-tuned CLIP-HBA-Behavior (red line) from Figure \ref{['fig:figure2']} are also plotted. CLIP-HBA-MEG demonstrates sustained higher behavioral alignment than the baseline soon after stimulus onset but peaks below the alignment achieved by CLIP-HBA-Behavior. (B) Neural validations: Temporal correlations between CLIP-HBA-MEG model-predicted RDMs and MEG RDMs are shown for THINGS images (top left panel) and external datasets under varied conditions: clear objects without background, clear monochromatic images, and blurry monochromatic images (remaining panels). CLIP-HBA-MEG consistently outperforms CLIP-ViT in correlation strength and area under the curve (AUC) across all tested conditions.
  • Figure 5: Example of dynamic saliency map of neurally fine-tuned CLIP-HBA-MEG
  • ...and 6 more figures