Table of Contents
Fetching ...

Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping

Guowei Zhang, Yun Zhao, Moein Khajehnejad, Adeel Razi, Levin Kuhlmann

TL;DR

The paper addresses the challenge of reconstructing natural images from fMRI by incorporating cortical hierarchy into diffusion-based decoders. It introduces Hi-DREAM, which uses an ROI adapter to form early/mid/late streams and a multi-scale cortical pyramid aligned with U-Net depths, with a depth-matched ROI-ControlNet for selective conditioning. On the NSD dataset, Hi-DREAM achieves state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity, and ablation studies reveal distinct roles for early, middle, and late ROIs. This work offers a neuroanatomically grounded, interpretable alternative to global embeddings and highlights how structured conditioning can advance brain-to-image generation and provide neuroscientific insights.

Abstract

Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping

TL;DR

The paper addresses the challenge of reconstructing natural images from fMRI by incorporating cortical hierarchy into diffusion-based decoders. It introduces Hi-DREAM, which uses an ROI adapter to form early/mid/late streams and a multi-scale cortical pyramid aligned with U-Net depths, with a depth-matched ROI-ControlNet for selective conditioning. On the NSD dataset, Hi-DREAM achieves state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity, and ablation studies reveal distinct roles for early, middle, and late ROIs. This work offers a neuroanatomically grounded, interpretable alternative to global embeddings and highlights how structured conditioning can advance brain-to-image generation and provide neuroscientific insights.

Abstract

Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Brain-inspired decoding pipeline.Hi-DREAMHi-DREAM mirrors the visual hierarchy: fMRI signals are grouped into early/mid/late ROIs (e.g., V1/V2 for edges, V3/V4 for color and parts, LOC/FFA for semantics) and transformed by a ROI adapter into a multi-scale cortical pyramid. Right: visual stimuli and few-sample reconstructions from Hi-DREAM, illustrating faithful structure and semantics while retaining interpretability and efficiency via compact ROI maps.
  • Figure 2: Overview of Hi-DREAM. fMRI signals are summarized into hierarchical ROI streams (early/mid/late), each with group-specific processing. A Multi-Head Latent Attention (MHLA) module performs gated cross-attention vaswani2017attentionli2021align between ROI-derived latents and U-Net features at multiple depths to capture cooperative interactions among areas. In parallel, a lightweight ROI-conditioned ControlNet consumes compact condition maps to inject neuroanatomical spatial priors, providing structure-aware guidance without full-volume fMRI conditioning.
  • Figure 3: Qualitative comparison on the NSD test set. Each group shows the stimulus (left) and the corresponding reconstructions by Hi-DREAM and prior methods.
  • Figure 4: Ablation on hierarchy modules (accuracy). Bars show, from left to right, w/o ROI Adapter, ROI Adapter (w/o MHLA), and ROI Adapter (w/ MHLA). The adapter markedly improves ANet(2/5) and Inception/CLIP, and adding MHLA further boosts high-level alignment.
  • Figure 5: Face-only subset. Hi-DREAM better preserves identity- and semantics-related cues (eyes, mouth, hairline), consistent with Late-ROI (FFA) guidance, while Early-ROI guidance stabilizes contours and local geometry.