Table of Contents
Fetching ...

Neural-Driven Image Editing

Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Hao Jin, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You

TL;DR

LoongX introduces a hands-free image editing framework driven by multimodal neural signals (EEG, fNIRS, PPG, and head motion) and a diffusion backbone. The core innovations are the Cross-Scale State Space (CS3) encoder and Dynamic Gated Fusion (DGF), which robustly integrate diverse biosignals and align them with edit semantics via a Diffusion Transformer (DiT), with neural encoders pretrained through contrastive learning. On the L-Mind dataset (23,928 image-editing pairs from 12 participants), LoongX achieves performance comparable to text-driven baselines (e.g., CLIP-I $=0.6605$ vs. $0.6558$, DINO $=0.4812$ vs. $0.4637$) and outperforms them when neural signals are combined with speech (CLIP-T $=0.2588$). These results demonstrate the viability of cognitive-driven image editing for accessibility, and the work opens new directions for cognitive-driven creative AI, with potential extensions to VR/AR environments and real-world deployment.

Abstract

Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.

Neural-Driven Image Editing

TL;DR

LoongX introduces a hands-free image editing framework driven by multimodal neural signals (EEG, fNIRS, PPG, and head motion) and a diffusion backbone. The core innovations are the Cross-Scale State Space (CS3) encoder and Dynamic Gated Fusion (DGF), which robustly integrate diverse biosignals and align them with edit semantics via a Diffusion Transformer (DiT), with neural encoders pretrained through contrastive learning. On the L-Mind dataset (23,928 image-editing pairs from 12 participants), LoongX achieves performance comparable to text-driven baselines (e.g., CLIP-I vs. , DINO vs. ) and outperforms them when neural signals are combined with speech (CLIP-T ). These results demonstrate the viability of cognitive-driven image editing for accessibility, and the work opens new directions for cognitive-driven creative AI, with potential extensions to VR/AR environments and real-world deployment.

Abstract

Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.

Paper Structure

This paper contains 54 sections, 20 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Illustration of LoongX for hands-free image editing via multimodal neural signals.
  • Figure 2: The L-Mind dataset comprises 23,928 multimodal editing samples, each including an original image, a ground truth text editing instruction, a ground truth edited image, as well as measured EEG, fNIRS, PPG, motion and speech signals. (a) Multimodal data collection pipeline; (b) Illustration and statistics of 35 types of image editing tasks.
  • Figure 3: Overview of our proposed LoongX method for hands-free image editing. Receiving an input image, LoongX outputs an edited image using neural signals (and optional speech) as conditions.
  • Figure 4: Evaluation of different signal combinations on the proposed DGF module.
  • Figure 5: Evaluation results on different brain region signals where LoongX is trained and tested on each respective EEG channel.
  • ...and 9 more figures