Table of Contents
Fetching ...

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Antonino Ferraro, Vincenzo Moscato

TL;DR

A staged vision-language framework for automated radiology report generation from 3D brain tumor MRI that inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization.

Abstract

Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

TL;DR

A staged vision-language framework for automated radiology report generation from 3D brain tumor MRI that inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization.

Abstract

Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility
Paper Structure (24 sections, 7 equations, 3 figures, 2 tables)

This paper contains 24 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Brain3D Architecture. A standardized MRI volume $X$ is processed by an inflated 3D Transformer encoder, producing $N$ volumetric patch tokens $Z_{\text{enc}}$. These tokens are compressed via Vision Token Compression into a fixed set of $K=32$ tokens $Z_{\text{cmp}}$. The compressed visual tokens are projected into the language embedding space ($Z_{\text{proj}}$) and scaled to obtain conditioning tokens $Z_{\text{cond}}$, which are prepended to the textual embeddings and used to guide autoregressive report generation by the causal LLM.
  • Figure 2: Staged Training Strategy The framework employs a progressive three-phase alignment pipeline. Phase 1: Contrastive Image-Text Grounding aligns the 3D representations ($Z_{vis}$) with report semantics ($Z_t$) using a symmetric InfoNCE loss. Phase 2A: Projector Warmup performs supervised generation with a frozen LLM to stabilize the visual-language mapping. Phase 2B: Linguistic Adaptation fine-tunes the projector and LoRA adapters jointly to capture neuroradiology syntax. Legend: Modules marked with Ice () are frozen; modules marked with Fire () are trainable.
  • Figure 3: 3D LIME Attribution Maps. Volumetric grounding visualized via 3D LIME over SLIC supervoxels for a representative test case. Red regions indicates positive attribution (supporting the report), blue regions negative attribution. The tumor-bearing hemisphere is correctly highlighted; however, diffuse and partially contralateral supervoxels are also activated, suggesting reliance on both lesion-centered and global contextual patterns, potentially contributing to lateralization errors.