DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, Jianfeng Feng
TL;DR
DecoFuse tackles the challenge of decoding videos from brain activity by decomposing visual content into semantic, spatial, and motion components that map onto distinct brain pathways. The framework employs a brain-inspired, modular pipeline with separate encoders for each component, Stable Diffusion conditioning for semantic and spatial decoding, a motion decoder for optical flow, and a motion-conditioned diffusion model for video generation. Empirical results show substantial gains across semantic, spatial, motion, and video-generation metrics, and neural encoding analyses corroborate alignment with the dorsal and ventral streams, supporting the two-stream hypothesis. This approach provides a biologically plausible and performance-advantaged route for fMRI-to-video decoding, with ablations confirming the necessity of each component and a public codebase to foster further research.
Abstract
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.
