CAVM: Conditional Autoregressive Vision Model for Contrast-Enhanced Brain Tumor MRI Synthesis
Lujun Gui, Chuyang Ye, Tianyi Yan
TL;DR
CAVM addresses the need to synthesize contrast-enhanced brain MRI without gadolinium by recasting the problem as progressive dose escalation. It introduces a decomposition tokenizer and a dose-variant autoregression built on LLaMA-style Transformers with a staircase self-attention mask, coupled with a Swin UNETR–based decoder, to generate $y_{LD}$, $y_{HD}$, and $y_{SD}$ from non-contrast inputs. Training combines autoencoding and image-to-image tasks, followed by autoregression optimization, yielding superior tumor-region synthesis and improved downstream segmentation on BraSyn-2023 compared with state-of-the-art baselines. The approach demonstrates strong potential for safer, dose-guided contrast synthesis with practical clinical impact and sets the stage for exploring output-state sampling in future work.
Abstract
Contrast-enhanced magnetic resonance imaging (MRI) is pivotal in the pipeline of brain tumor segmentation and analysis. Gadolinium-based contrast agents, as the most commonly used contrast agents, are expensive and may have potential side effects, and it is desired to obtain contrast-enhanced brain tumor MRI scans without the actual use of contrast agents. Deep learning methods have been applied to synthesize virtual contrast-enhanced MRI scans from non-contrast images. However, as this synthesis problem is inherently ill-posed, these methods fall short in producing high-quality results. In this work, we propose Conditional Autoregressive Vision Model (CAVM) for improving the synthesis of contrast-enhanced brain tumor MRI. As the enhancement of image intensity grows with a higher dose of contrast agents, we assume that it is less challenging to synthesize a virtual image with a lower dose, where the difference between the contrast-enhanced and non-contrast images is smaller. Thus, CAVM gradually increases the contrast agent dosage and produces higher-dose images based on previous lower-dose ones until the final desired dose is achieved. Inspired by the resemblance between the gradual dose increase and the Chain-of-Thought approach in natural language processing, CAVM uses an autoregressive strategy with a decomposition tokenizer and a decoder. Specifically, the tokenizer is applied to obtain a more compact image representation for computational efficiency, and it decomposes the image into dose-variant and dose-invariant tokens. Then, a masked self-attention mechanism is developed for autoregression that gradually increases the dose of the virtual image based on the dose-variant tokens. Finally, the updated dose-variant tokens corresponding to the desired dose are decoded together with dose-invariant tokens to produce the final contrast-enhanced MRI.
