Table of Contents
Fetching ...

Reanimating Images using Neural Representations of Dynamic Stimuli

Jacob Yeung, Andrew F. Luo, Gabriel Sarch, Margaret M. Henderson, Deva Ramanan, Michael J. Tarr

TL;DR

This work addresses how the brain represents dynamic visual motion and how such neural information can improve artificial video understanding and generation. It proposes BrainNRDS, a disentangled framework that decouples static image representations from motion representations, enabling fMRI-based decoding of optical flow and subsequent reanimation of a video’s initial frame via a motion-conditioned diffusion model. Key findings include: (1) brain activity can predict fine-grained optical flow; (2) video encoders outperform image encoders in predicting brain responses; (3) brain-decoded motion enables realistic video reanimation from a single initial frame; and (4) full video decoding from brain activity is feasible when conditioning on brain-derived motion. These results advance our understanding of neural dynamics in dynamic scenes and point toward brain-informed, biolically plausible enhancements for robust video understanding and generation systems.

Abstract

While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach, BrainNRDS (Brain-Neural Representations of Dynamic Stimuli), leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. BrainNRDS advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems. We show additional decoding and encoding examples on this site: https://brain-nrds.github.io/.

Reanimating Images using Neural Representations of Dynamic Stimuli

TL;DR

This work addresses how the brain represents dynamic visual motion and how such neural information can improve artificial video understanding and generation. It proposes BrainNRDS, a disentangled framework that decouples static image representations from motion representations, enabling fMRI-based decoding of optical flow and subsequent reanimation of a video’s initial frame via a motion-conditioned diffusion model. Key findings include: (1) brain activity can predict fine-grained optical flow; (2) video encoders outperform image encoders in predicting brain responses; (3) brain-decoded motion enables realistic video reanimation from a single initial frame; and (4) full video decoding from brain activity is feasible when conditioning on brain-derived motion. These results advance our understanding of neural dynamics in dynamic scenes and point toward brain-informed, biolically plausible enhancements for robust video understanding and generation systems.

Abstract

While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach, BrainNRDS (Brain-Neural Representations of Dynamic Stimuli), leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. BrainNRDS advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems. We show additional decoding and encoding examples on this site: https://brain-nrds.github.io/.
Paper Structure (16 sections, 31 figures, 3 tables)

This paper contains 16 sections, 31 figures, 3 tables.

Figures (31)

  • Figure 1: Encoding and decoding video motion using brain activity.(a) fMRI brain activity can be predicted (using a Pearson coefficient) using off-the-shelf video encoders (VideoMAE tong2022videomae) extracted from the viewed video. In the converse direction, we can generate video by decoding brain activity. (b) Many existing video diffusion models (e.g., SVD blattmann2023stable) generate a video by animating an initial frame. (c) This suggests that"brain-to-video" generation can be achieved by fine-tuning diffusers to condition on fMRI input (e.g., MindVideo chen2023cinematic). (d) In our approach we explicitly decouple the task of image and motion generation from brain activity. Given an initial video frame (which could be decoded from brain activity as in (c)) and fMRI input , we train a network to predict optical flow. We then animate the initial frame by feeding the predicted flow into an off-the-shelf motion-conditioned diffusion model (e.g., DragNUWAyin2023dragnuwa). Our disentangled pipeline produces more accurate brain-conditioned motion decodings than either (b) or (c). (e) Ground truth optical flow and video.
  • Figure 2: BrainNRDS pipeline for motion decoding and video generation.(a) BrainNRDS takes in neural data and image features from DINOv2 oquab2023dinov2, extracted from the initial frame to predict consecutive future dense optical flow fields. Salient objects are masked out using FlowSAM xie2024flowsam to obtain the masked object flow. Snowflakes=frozen; flames=actively trained. (b) The initial frame is realistically animated using our predicted motion and a motion-conditioned video diffusion model, DragNUWA yin2023dragnuwa.
  • Figure 3: Quantitative motion decoding. Optical flow predictors trained with neural data (Ours) are statistically better than both generative models trained without neural data (No Brain - Stable Video Diffusion (Best) blattmann2023stable) and generative models that fail to disentangle appearance and motion (MindVideo (Best) chen2023cinematic). We find that our method conditioned on the initial frame generated by MindVideo (Ours + MindVideo (Best)) better predicts the optical flow than MindVideo. We average the end point error over the predicted and ground truth masked optical flow vectors. Similar to eslami2024rethinkingraftefficientoptical, we report end point error on pixels whose ground truth flow magnitudes exceed 1% of the pixel width of the frames. Lower values are better. Paired $t$-tests comparing end point error for "Ours" versus SVD (Best) for each participant are as follows: S1: $p \le$ 1.197e-8; S2: $p \le$ 3.157e-8; S3: $p \le$ 9.140e-7. Error bars represent the standard error of the mean.
  • Figure 4: Motion predicted with and without neural data. Motion predicted using brain data and a static reference image is qualitatively both plausible and aligned with the original video compared to motion predicted from only the reference image using Stable Video Diffusion blattmann2023stable. Motion predicted from purely image features is plausible but not aligned with the original video. In this latter case, the model is hallucinating flows. (a) An example video where there are multiple plausible actions. (b) An example video where camera motion overrides the plausible motion. (c) An example video where there is ambiguity of stationary objects due to camera motion.
  • Figure 5: Static image animation results. Optical flow predicted from fMRI brain activity ("Our Flow") and the ground truth initial frame (gray box) compared against the ground truth ("GT") flow. Beneath the ground truth video frames, we show our results for animating the initial frame by combining the brain conditioned motion prediction with DragNUWA yin2023dragnuwa ("Our Video").
  • ...and 26 more figures