Table of Contents
Fetching ...

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, Ke Liu, Yu Zhang

TL;DR

NeuroClips tackles the problem of reconstructing continuous video from non-invasive fMRI by decoupling low-level perceptual flows and high-level semantics into two trainable pathways that guide a pre-trained text-to-video diffusion model. The Perception Reconstructor and Semantics Reconstructor produce a blurry, motion-rich stream and semantically faithful keyframes, respectively, which are fused through an inference pipeline with $\ ext{alpha}$, $\text{beta}$, and $\text{gamma}$ guidance to yield high-fidelity video. A Multi-fMRI Fusion strategy extends capabilities to longer sequences (up to $6$ s at $8$ FPS) by leveraging CLIP-based semantic similarity and a lightweight MLP to blend frames without additional training. Empirically, NeuroClips achieves substantial gains over state-of-the-art methods in pixel- and spatiotemporal metrics and offers interpretable voxel-weight patterns across the visual cortex, highlighting its potential for brain-computer interfaces while acknowledging data limits and cross-scene generalization challenges.

Abstract

Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

TL;DR

NeuroClips tackles the problem of reconstructing continuous video from non-invasive fMRI by decoupling low-level perceptual flows and high-level semantics into two trainable pathways that guide a pre-trained text-to-video diffusion model. The Perception Reconstructor and Semantics Reconstructor produce a blurry, motion-rich stream and semantically faithful keyframes, respectively, which are fused through an inference pipeline with , , and guidance to yield high-fidelity video. A Multi-fMRI Fusion strategy extends capabilities to longer sequences (up to s at FPS) by leveraging CLIP-based semantic similarity and a lightweight MLP to blend frames without additional training. Empirically, NeuroClips achieves substantial gains over state-of-the-art methods in pixel- and spatiotemporal metrics and offers interpretable voxel-weight patterns across the visual cortex, highlighting its potential for brain-computer interfaces while acknowledging data limits and cross-scene generalization challenges.

Abstract

Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.

Paper Structure

This paper contains 31 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The overall framework of NeuroClips. The red lines represent the infernence process.
  • Figure 2: Visualization of Multi-fMRI fusion. With the semantic relevance measure, we can generate video clips up to 6s long without any additional training.
  • Figure 3: Video reconstruction on the cc2017 dataset. On the left are the results of the comparison with previous studies, and on the right are additional comparisons with previous SOTA methods. Best viewed with zoom-in. As shown in the leftmost figure group, Mind-Video's reconstruction fails to go for detail consistency on the character's face, but our NeuroClips achieves an extremely high consistency.
  • Figure 4: Visualization of ablation study.
  • Figure 5: Visualization of voxel weights for the first ridge regression layer for subject 1, with each voxel's weight averaged and normalized to between 0 and 1 and we set the colorbar to 0.25-0.75 for a clear comparison.
  • ...and 8 more figures