Table of Contents
Fetching ...

Connecting Dreams with Visual Brainstorming Instruction

Yasheng Sun, Bohan Li, Mingchen Zhuge, Deng-Ping Fan, Salman Khan, Fahad Shahbaz Khan, Hideki Koike

TL;DR

This work addresses enabling interactive control of brain-derived visual content by translating fMRI signals into imagery that can be steered with natural language. It introduces DreamConnect, a dual-stream diffusion framework with an adaptor, an asynchronous diffusion strategy, and LLM-guided region-aware manipulation to map brain activity to intention-driven edits. The approach leverages Stable Diffusion-like backbones and CLIP-based embeddings, trained in a two-stage process on the NSD dataset, and demonstrates competitive reconstruction performance and superior instruction-following in qualitative and quantitative tests, with comprehensive ablations. The work highlights potential for multimodal brain-computer interfaces and acknowledges ethical considerations and limitations, pointing to future work on internal dreams, small-object edits, and multi-turn interactions.

Abstract

Recent breakthroughs in understanding the human brain have revealed its impressive ability to efficiently process and interpret human thoughts, opening up possibilities for intervening in brain signals. In this paper, we aim to develop a straightforward framework that uses other modalities, such as natural language, to translate the original dreamland. We present DreamConnect, employing a dual-stream diffusion framework to manipulate visually stimulated brain signals. By integrating an asynchronous diffusion strategy, our framework establishes an effective interface with human dreams, progressively refining their final imagery synthesis. Through extensive experiments, we demonstrate the method ability to accurately instruct human brain signals with high fidelity. Our project will be publicly available on https://github.com/Sys-Nexus/DreamConnect

Connecting Dreams with Visual Brainstorming Instruction

TL;DR

This work addresses enabling interactive control of brain-derived visual content by translating fMRI signals into imagery that can be steered with natural language. It introduces DreamConnect, a dual-stream diffusion framework with an adaptor, an asynchronous diffusion strategy, and LLM-guided region-aware manipulation to map brain activity to intention-driven edits. The approach leverages Stable Diffusion-like backbones and CLIP-based embeddings, trained in a two-stage process on the NSD dataset, and demonstrates competitive reconstruction performance and superior instruction-following in qualitative and quantitative tests, with comprehensive ablations. The work highlights potential for multimodal brain-computer interfaces and acknowledges ethical considerations and limitations, pointing to future work on internal dreams, small-object edits, and multi-turn interactions.

Abstract

Recent breakthroughs in understanding the human brain have revealed its impressive ability to efficiently process and interpret human thoughts, opening up possibilities for intervening in brain signals. In this paper, we aim to develop a straightforward framework that uses other modalities, such as natural language, to translate the original dreamland. We present DreamConnect, employing a dual-stream diffusion framework to manipulate visually stimulated brain signals. By integrating an asynchronous diffusion strategy, our framework establishes an effective interface with human dreams, progressively refining their final imagery synthesis. Through extensive experiments, we demonstrate the method ability to accurately instruct human brain signals with high fidelity. Our project will be publicly available on https://github.com/Sys-Nexus/DreamConnect
Paper Structure (40 sections, 10 equations, 12 figures, 4 tables)

This paper contains 40 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Can dreams be connected and actively influenced in future applications? As can be seen, DreamConnect precisely performs the desired operation on the visual content. For example, suppose someone imagines a lake view (see the first row) and another one considers changing it to a sunset scene (the second row). In that case, our system faithfully generates the desired sunset ambiance (the third row).
  • Figure 2: Illustration of the proposed DreamConnect framework. A person's biological signal indicated by fMRI sequences $X$, will be activated within his brain according to his "dreams" represented by the visual stimuli $Y$. Our system targets to interface with $X$ via a natural language instruction $I$. Specifically, the fMRI signal is regressed to CLIP text and visual embedding, $f_{\text{CLIP}}^{text}$ and $f_{\text{CLIP}}^{vis}$, which are leveraged to aligned with visual content $z^r_t$ in VersNetxu2023versatile. After modulated by an adaptor, its intermediate spatial features are fed to InstNetgeng2023instructdiffusion. Encoded by CLIP text encoder, the human instruction $I$ is injected to InstNet to modulate these features toward intended direction.
  • Figure 3: Illustration of Dataset Pipeline. Given the paired fMRI signal $X$ and its corresponding visual stimuli $Y$ in NSD allen2022massive, we first query LLMs with image caption to obtain possible instruction $I$. Then the instruction, coupled with visual stimuli, is utilized prompt a visual instruction model for manipulated image $Y^{edit}$. Finally, the triplets of $(X, I, Y^{edit})$, are constructed (red boxes).
  • Figure 4: Qualitative Comparison. In the first column list the visual stimulus of fMRI signal while the last column demonstrates input images used by other approaches. InstP2P brooks2023instructpix2pix struggles on capturing instruction intention (see last 4 rows). InstDiff geng2023instructdiffusion and MagicBrush zhang2024magicbrush tends to over-edit irrelevant regions (see toilet). SDEdit meng2021sdedit suffers from inferior image quality. Our method well balances the content preservation and instruction conformation.
  • Figure 5: Ablation Study. Without feature injection from adaptor, the model struggles on balancing content preservation and instruction conformation. Removing asynchronous strategy brings difficulty on instruction conformation. Through the incorporation of LLMs guided region, our framework is able to precisely operate on relevant spatial locations (see the snow background and table region).
  • ...and 7 more figures