Table of Contents
Fetching ...

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding

Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, Jianfeng Feng

TL;DR

DecoFuse tackles the challenge of decoding videos from brain activity by decomposing visual content into semantic, spatial, and motion components that map onto distinct brain pathways. The framework employs a brain-inspired, modular pipeline with separate encoders for each component, Stable Diffusion conditioning for semantic and spatial decoding, a motion decoder for optical flow, and a motion-conditioned diffusion model for video generation. Empirical results show substantial gains across semantic, spatial, motion, and video-generation metrics, and neural encoding analyses corroborate alignment with the dorsal and ventral streams, supporting the two-stream hypothesis. This approach provides a biologically plausible and performance-advantaged route for fMRI-to-video decoding, with ablations confirming the necessity of each component and a public codebase to foster further research.

Abstract

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding

TL;DR

DecoFuse tackles the challenge of decoding videos from brain activity by decomposing visual content into semantic, spatial, and motion components that map onto distinct brain pathways. The framework employs a brain-inspired, modular pipeline with separate encoders for each component, Stable Diffusion conditioning for semantic and spatial decoding, a motion decoder for optical flow, and a motion-conditioned diffusion model for video generation. Empirical results show substantial gains across semantic, spatial, motion, and video-generation metrics, and neural encoding analyses corroborate alignment with the dorsal and ventral streams, supporting the two-stream hypothesis. This approach provides a biologically plausible and performance-advantaged route for fMRI-to-video decoding, with ablations confirming the necessity of each component and a public codebase to foster further research.

Abstract

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Diagram of DecoFuse framework. Inspired by the brain’s two-streams hypothesis GOODALE199220, the DecoFuse pipeline decomposes video into three components: semantic ("what"), spatial ("where"), and motion ("how"). Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings. These components are then fused to generate video. Additionally, neural encoding analyzes the differential contribution of semantic and spatial embeddings in predicting signals from the brain's dorsal and ventral streams, confirming alignment with the two-streams hypothesis GOODALE199220.
  • Figure 2: Details of DecoFuse framework. Neural features are extracted by an fMRI encoder and decomposed to semantic, spatial and motion embeddings through three independent encoders. These components are then fused to generate video via three stages: (1) fMRI-to-image decoding, which uses Stable Diffusion and ControlNet to generate static images based on high-level semantic and low-level spatial embeddings; (2) fMRI-to-motion decoding, predicting optical flow using an image- and fMRI-based motion decoder to capture dynamic elements of the video; (3) fMRI-to-video decoding, where the decoded image and optical flow are combined to generate the final video using a motion-conditioned video diffusion model.
  • Figure 3: Results of fMRI-to-image reconstruction. Our model successfully generates images that align well with the ground truth in both semantic and spatial aspects. By comparing the results with and without semantic("what")/spatial("where") embeddings, we demonstrate that semantic and spatial embeddings significantly enhance the model’s ability to accurately reconstruct and localize objects within the image.
  • Figure 4: Results of fMRI-to-motion decoding. Our model effectively predicts optical flow based on fMRI and image data, demonstrating accurate motion decoding performance.
  • Figure 5: Our fMRI-to-video decoding. Our model shows accurate decoding performance at both the semantic and pixel levels.
  • ...and 1 more figures