Table of Contents
Fetching ...

Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging

Runze Xia, Shuo Feng, Renzhi Wang, Congchi Yin, Xuyun Wen, Piji Li

TL;DR

This work tackles the problem of missing details in brain-to-image reconstruction by enriching semantic targets. It introduces Fine-grained Brain-to-Image reconstruction (FgB2I), a three-stage framework that first enhances image captions with large vision-language models, then decodes fine-grained text from fMRI via a unified brain-to-text model trained with reinforced co-training using three reward metrics, and finally bridges text to diffusion-based image reconstruction by fusing decoded text semantics with existing high-level representations. The approach demonstrates that fine-grained textual descriptions can improve semantic fidelity across multiple reconstruction pipelines (LDM, BrainDiffuser, MindEye), with notable gains for text-driven control and meaningful qualitative improvements in cases where captions previously missed key objects or relations. These findings suggest substantial potential for improving brain decoding accuracy and semantic reconstruction by leveraging LVLMs and reinforcement-guided text synthesis, while also highlighting challenges in hallucinations and fMRI signal limitations that warrant future work.

Abstract

Brain-to-Image reconstruction aims to recover visual stimuli perceived by humans from brain activity. However, the reconstructed visual stimuli often missing details and semantic inconsistencies, which may be attributed to insufficient semantic information. To address this issue, we propose an approach named Fine-grained Brain-to-Image reconstruction (FgB2I), which employs fine-grained text as bridge to improve image reconstruction. FgB2I comprises three key stages: detail enhancement, decoding fine-grained text descriptions, and text-bridged brain-to-image reconstruction. In the detail-enhancement stage, we leverage large vision-language models to generate fine-grained captions for visual stimuli and experimentally validate its importance. We propose three reward metrics (object accuracy, text-image semantic similarity, and image-image semantic similarity) to guide the language model in decoding fine-grained text descriptions from fMRI signals. The fine-grained text descriptions can be integrated into existing reconstruction methods to achieve fine-grained Brain-to-Image reconstruction.

Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging

TL;DR

This work tackles the problem of missing details in brain-to-image reconstruction by enriching semantic targets. It introduces Fine-grained Brain-to-Image reconstruction (FgB2I), a three-stage framework that first enhances image captions with large vision-language models, then decodes fine-grained text from fMRI via a unified brain-to-text model trained with reinforced co-training using three reward metrics, and finally bridges text to diffusion-based image reconstruction by fusing decoded text semantics with existing high-level representations. The approach demonstrates that fine-grained textual descriptions can improve semantic fidelity across multiple reconstruction pipelines (LDM, BrainDiffuser, MindEye), with notable gains for text-driven control and meaningful qualitative improvements in cases where captions previously missed key objects or relations. These findings suggest substantial potential for improving brain decoding accuracy and semantic reconstruction by leveraging LVLMs and reinforcement-guided text synthesis, while also highlighting challenges in hallucinations and fMRI signal limitations that warrant future work.

Abstract

Brain-to-Image reconstruction aims to recover visual stimuli perceived by humans from brain activity. However, the reconstructed visual stimuli often missing details and semantic inconsistencies, which may be attributed to insufficient semantic information. To address this issue, we propose an approach named Fine-grained Brain-to-Image reconstruction (FgB2I), which employs fine-grained text as bridge to improve image reconstruction. FgB2I comprises three key stages: detail enhancement, decoding fine-grained text descriptions, and text-bridged brain-to-image reconstruction. In the detail-enhancement stage, we leverage large vision-language models to generate fine-grained captions for visual stimuli and experimentally validate its importance. We propose three reward metrics (object accuracy, text-image semantic similarity, and image-image semantic similarity) to guide the language model in decoding fine-grained text descriptions from fMRI signals. The fine-grained text descriptions can be integrated into existing reconstruction methods to achieve fine-grained Brain-to-Image reconstruction.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The top section is the illustration of cognitive assumptions during scene observation. The bottom section demonstrates a comparison of the granularity between the detail-enhanced and the original caption.
  • Figure 2: Overview of FgB2I. (a) Detail enhancement of visual stimuli captions through LVLMs. (b) The process of decoding fine-grained text descriptions from brain signals, with the details inside the blue box further illustrated in Figure \ref{['fig:semantic']}. (c) The workflow for combining fine-grained text descriptions with existing methods, including semantic fusion through weighted average of text semantic embedding and the integration of text and image embeddings.
  • Figure 3: Diagram of the brain-to-text model structure and training. The flame represents the trainable components.
  • Figure 4: Three reward function calculation diagrams. (Left) Reward for evaluating the accuracy of decoded objects. (Middle) Semantic similarity between decoded text and visual stimuli, where $C_i$ and $C_t$ represent the corresponding CLIP embedding of the image and text. (Right) Semantic similarity between the reconstructed image and visual stimuli.
  • Figure 5: A comparison of the reconstructed image results between existing methods (LDM takagi2023high, BrainDiffuser ozcelik2303brain, and MindEye scotti2023reconstructing) and the results obtained when these methods are combined with our fine-grained text descriptions. GT denotes the corresponding ground truth stimulus image.