Table of Contents
Fetching ...

Autoregressive Visual Decoding from EEG Signals

Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye

TL;DR

AVDE is presented, a lightweight and efficient framework for visual decoding from EEG signals that outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters.

Abstract

Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

Autoregressive Visual Decoding from EEG Signals

TL;DR

AVDE is presented, a lightweight and efficient framework for visual decoding from EEG signals that outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters.

Abstract

Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.
Paper Structure (32 sections, 8 equations, 11 figures, 14 tables)

This paper contains 32 sections, 8 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: A typical unCLIP framework employed in previous EEG-based visual decoding works li2024visualzhang2025cognitioncapturerxiao2025eegscotti2023reconstructing. Despite its flexibility, the framework comprises multiple stages (five in this case), each introducing potential sources of error that can accumulate and degrade overall performance. Furthermore, the computational and memory demands of its components present significant challenges for practical implementation in BCIs.
  • Figure 2: AVDE involves two training stages. Stage 1: A pre-trained EEG encoder is fine-tuned using contrastive learning to more effectively capture visual information embedded in EEG signals. This adaptation aims to provide a more informative initialization for the subsequent visual reconstruction process. Stage 2: A visual autoregressive transformer is trained using the next-scale prediction objective (Equation \ref{['eq:next_scale_predict']}). Specifically, the model takes the sequence $([s], R_1, R_2, \dots, R_{K-1})$ as input and predicts the corresponding sequence $(R_1, R_2, R_3, \dots, R_K)$. Training is guided by a standard cross-entropy loss.
  • Figure 3: Qualitative Comparison of Visual Reconstruction Performance. Selected reconstruction results from subject-08 demonstrate that the visual stimuli reconstructed by our method preserve finer-grained features, suggesting improved fidelity and detail compared to alternative approaches.
  • Figure 4: Intermediate reconstructions generated by AVDE across 10 progressive scales. Each row corresponds to a distinct EEG-evoked reconstruction instance, and each column represents the cumulative output up to a given scale. This process reflects the hierarchical nature of human visual perception, drawing parallels to the function of successive cortical visual areas (e.g., V1, V2/V4, and IT).
  • Figure 5: Analysis of similarities between intermediate scales and brain regions. (a) The mean channel embeddings from five brain regions are compared with the intermediate image embeddings. Cosine similarity is used as the measure. (b) Since the generative process is cumulative, the similarities generally increase as more scales are involved. (c) Stepwise increase captures the incremental information contributed by each scale. The step increase for occipital regions peaks at early scales and gradually diminishes thereafter. The temporal and parietal regions exhibit relatively sustained step increases across early and middle scales, followed by a decline in later scales. The frontal and central regions show low step increases initially, which progressively rise and peak at late scales.
  • ...and 6 more figures