Table of Contents
Fetching ...

A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li

TL;DR

The paper tackles cross-subject, subject-agnostic reconstruction of dynamic visual experiences from fMRI. It introduces Visual Cortex Flow (VCFlow), a triple-component architecture that mirrors the brain's ventral-dorsal streams to extract multi-level semantic features and a Redistribution Adapter to normalize across subjects. By tying features to OpenCLIP embeddings and applying BiMixCo and inter-subject contrastive goals, VCFlow achieves high-fidelity video reconstructions without subject-specific retraining, with only a modest performance drop compared to subject-specific models. The approach yields interpretable neural alignments with V1-V4, FFA/PPA, and MST, underscoring its neurobiological plausibility and clinical relevance.

Abstract

Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.

A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

TL;DR

The paper tackles cross-subject, subject-agnostic reconstruction of dynamic visual experiences from fMRI. It introduces Visual Cortex Flow (VCFlow), a triple-component architecture that mirrors the brain's ventral-dorsal streams to extract multi-level semantic features and a Redistribution Adapter to normalize across subjects. By tying features to OpenCLIP embeddings and applying BiMixCo and inter-subject contrastive goals, VCFlow achieves high-fidelity video reconstructions without subject-specific retraining, with only a modest performance drop compared to subject-specific models. The approach yields interpretable neural alignments with V1-V4, FFA/PPA, and MST, underscoring its neurobiological plausibility and clinical relevance.

Abstract

Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.

Paper Structure

This paper contains 36 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Former methods are typically subject-dependent, meaning that when encountering a new patient, approximately 12 hours of training are required to build a subject-specific model. Such requirements severely constrain the practical applicability and clinical utility of these approaches. By contrast, our method ensures applicability at the subject-agnostic level, allowing inference on a new patient without any additional training and requiring only about 10 seconds of testing, which provides substantial advantages for downstream tasks.
  • Figure 2: The visual cortex can be broadly divided into three types of areas: early visual, ventral, and dorsal. Early visual areas are primarily responsible for detecting low-level features including edges, orientation, and color. Ventral areas are associated with the processing of higher-level and abstract visual information. In contrast, dorsal areas are specialized for encoding dynamic features and spatial representations.
  • Figure 3: The overall framework of VCFlow consists of three core components: (1) Hierarchical Cognitive Alignment Module (HCAM), (2) Subject-Agnostic Redistribution Adapter (SARA), and (3) Hierarchical Explicit Decoder (HED). VCFlow learns three types of semantic representations through HCAM, which are then fused with subject-agnostic common features extracted by SARA. These enriched representations are subsequently decoded by HED to explicitly reconstruct information across multiple semantic levels.
  • Figure 4: The inference stage of VCFlow integrates multi-level semantic embeddings to facilitate comprehensive decoding.
  • Figure 5: Compared with GLFA li2024glfa, the qualitative comparison demonstrates that VCFlow achieves superior semantic fidelity and temporal coherence, effectively capturing fine-grained semantics and preserving motion information in a subject-agnostic setting.
  • ...and 3 more figures