A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li
TL;DR
The paper tackles cross-subject, subject-agnostic reconstruction of dynamic visual experiences from fMRI. It introduces Visual Cortex Flow (VCFlow), a triple-component architecture that mirrors the brain's ventral-dorsal streams to extract multi-level semantic features and a Redistribution Adapter to normalize across subjects. By tying features to OpenCLIP embeddings and applying BiMixCo and inter-subject contrastive goals, VCFlow achieves high-fidelity video reconstructions without subject-specific retraining, with only a modest performance drop compared to subject-specific models. The approach yields interpretable neural alignments with V1-V4, FFA/PPA, and MST, underscoring its neurobiological plausibility and clinical relevance.
Abstract
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
