Table of Contents
Fetching ...

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, Huiguang He

TL;DR

This work tackles reconstructing dynamic natural vision from slow fMRI signals by decoupling semantic, structural, and motion information into separate decoders and then integrating them with an inflated diffusion-based video generator that has never been exposed to video data. The Mind-Animator framework combines tri-modal contrastive learning for semantic decoding, a VQ-VAE–based structure representation, and a Transformer-based Consistency Motion Generator with Sparse Causal Attention to recover coherent, motion-consistent video frames. It demonstrates state-of-the-art performance across three public video-fMRI datasets on semantic, pixel, and spatiotemporal metrics, and includes rigorous interpretability analyses that link decoded features to specific visual cortex regions. The approach reduces reliance on external video priors for motion, offering a more neurobiologically faithful reconstruction pipeline and providing open data and code to accelerate future research.

Abstract

Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. Although prior video reconstruction methods have made substantial progress, they still suffer from several limitations, including: (1) difficulty in simultaneously reconciling semantic (e.g. categorical descriptions), structure (e.g. size and color), and consistent motion information (e.g. order of frames); (2) low temporal resolution of fMRI, which poses a challenge in decoding multiple frames of video dynamics from a single fMRI frame; (3) reliance on video generation models, which introduces ambiguity regarding whether the dynamics observed in the reconstructed videos are genuinely derived from fMRI data or are hallucinations from generative model. To overcome these limitations, we propose a two-stage model named Mind-Animator. During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI. Specifically, we employ fMRI-vision-language tri-modal contrastive learning to decode semantic feature from fMRI and design a sparse causal attention mechanism for decoding multi-frame video motion features through a next-frame-prediction task. In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion, effectively eliminating external video data interference. Extensive experiments on multiple video-fMRI datasets demonstrate that our model achieves state-of-the-art performance. Comprehensive visualization analyses further elucidate the interpretability of our model from a neurobiological perspective. Project page: https://mind-animator-design.github.io/.

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

TL;DR

This work tackles reconstructing dynamic natural vision from slow fMRI signals by decoupling semantic, structural, and motion information into separate decoders and then integrating them with an inflated diffusion-based video generator that has never been exposed to video data. The Mind-Animator framework combines tri-modal contrastive learning for semantic decoding, a VQ-VAE–based structure representation, and a Transformer-based Consistency Motion Generator with Sparse Causal Attention to recover coherent, motion-consistent video frames. It demonstrates state-of-the-art performance across three public video-fMRI datasets on semantic, pixel, and spatiotemporal metrics, and includes rigorous interpretability analyses that link decoded features to specific visual cortex regions. The approach reduces reliance on external video priors for motion, offering a more neurobiologically faithful reconstruction pipeline and providing open data and code to accelerate future research.

Abstract

Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. Although prior video reconstruction methods have made substantial progress, they still suffer from several limitations, including: (1) difficulty in simultaneously reconciling semantic (e.g. categorical descriptions), structure (e.g. size and color), and consistent motion information (e.g. order of frames); (2) low temporal resolution of fMRI, which poses a challenge in decoding multiple frames of video dynamics from a single fMRI frame; (3) reliance on video generation models, which introduces ambiguity regarding whether the dynamics observed in the reconstructed videos are genuinely derived from fMRI data or are hallucinations from generative model. To overcome these limitations, we propose a two-stage model named Mind-Animator. During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI. Specifically, we employ fMRI-vision-language tri-modal contrastive learning to decode semantic feature from fMRI and design a sparse causal attention mechanism for decoding multi-frame video motion features through a next-frame-prediction task. In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion, effectively eliminating external video data interference. Extensive experiments on multiple video-fMRI datasets demonstrate that our model achieves state-of-the-art performance. Comprehensive visualization analyses further elucidate the interpretability of our model from a neurobiological perspective. Project page: https://mind-animator-design.github.io/.
Paper Structure (61 sections, 11 equations, 28 figures, 17 tables, 2 algorithms)

This paper contains 61 sections, 11 equations, 28 figures, 17 tables, 2 algorithms.

Figures (28)

  • Figure 1: The human brain's comprehension of dynamic visual scenes. When receiving dynamic visual information, human brain gradually comprehends low-level structural details such as position, shape and color in the primary visual cortex, discerns motion information, and ultimately constructs high-level semantic information in the higher visual cortex, such as an overall description of the scene.
  • Figure 2: Overview of the video reconstruction paradigms.
  • Figure 3: The overall architecture of Mind-Animator, a two-stage video reconstruction model based on fMRI. Three decoders are trained during the fMRI-to-feature stage to disentangle semantic, structural, and motion feature from fMRI, respectively. In the feature-to-video stage, the decoded information is input into an inflated Text-to-Image (T2I) model for video reconstruction.
  • Figure 4: The architecture of CMG with Temporal Module and fMRI guided Spatial Module.
  • Figure 5: Reconstruction results on CC2017 dataset. Our reconstructed results are highlighted with a red box, while those of Wen and Nishimoto are delineated by blue and green boxes, respectively.
  • ...and 23 more figures