SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Minghan Yang; Lan Yang; Ke Li; Honggang Zhang; Kaiyue Pang; Yizhe Song

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song

TL;DR

SemVideo is introduced, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information that achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

Abstract

Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

TL;DR

Abstract

Paper Structure (29 sections, 14 equations, 12 figures, 6 tables)

This paper contains 29 sections, 14 equations, 12 figures, 6 tables.

Introduction
Related Work
Video Caption Generation
Diffusion-based video generation
fMRI-based Video Reconstruction
Methodology
SemMiner
SemVideo
Experiments
Main Results
Validating the Source of Motion Improvement.
Ablation Study
Neuroscience Interpretability
Conclusion
SemMiner
...and 14 more sections

Figures (12)

Figure 1: Top: While a subject watches an original video stimulus, their brain activity is recorded via fMRI. Middle: Reconstructed results from previous methods, which suffer from two issues: Appearance Mismatch and Motion Misalignment. Bottom: Reconstructed result from SemVideo, achieves both semantic consistency (reconstructing the "kitten") and motion coherence (matching dynamic actions like "crouching" and "turning") by leveraging hierarchical semantic descriptions as intermediate targets to guide the fMRI signal decoding process.
Figure 2: Overview of the SemVideo training pipeline. In the first stage, the Semantic Alignment Decoder is trained to map fMRI signals to three levels of semantic targets, denoted as $Z(C_L)$. In the second stage, the Motion Adaptation Decoder is trained to utilize the predicted motion semantics $\hat{Z}(C_{\text{motion}})$ to refine the latent embedding of each reconstructed frame.
Figure 3: Overview of the SemVideo inference pipeline. fMRI signals are first decoded into $\hat{Z}(C_{\text{L}})$ by the SAD. $\hat{Z}(C_{\text{motion}})$ conditions the MAD to refine frame embeddings $\hat{E}(x)$, which are passed through a VAE decoder, generating a blurry video. $\hat{E}(x)$ and $\hat{Z}(C_{\text{anchor}})$ guide the SD model to generate anchor frame, combined with the blurry video and $\hat{Z}(C_{\text{holi}})$, is fed into a T2V model, yielding final reconstruction.
Figure 4: Qualitative comparison of reconstruction results on the CC2017 dataset between previous methods and our proposed SemVideo. Reconstructions generated by SemVideo are highlighted with red boxes.
Figure 5: The figure presents two sets of results from subj01 of the CC2017 dataset. On the left, the results of a shuffle test are shown; significance was determined using paired t-tests with Bonferroni correction ($P<0.05$). On the right, the results of an ablation study are displayed, evaluating the impact of the $C_{motion}$ and $MAD$ guidance components.
...and 7 more figures

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

TL;DR

Abstract

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (12)