Table of Contents
Fetching ...

MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan

TL;DR

MedITok introduces a unified visual tokenizer for medical images that encodes both fine-grained visual structure and rich clinical semantics in a single latent space. It leverages a two-stage curriculum-inspired training framework—visual representation alignment on large-scale medical images and textual semantic alignment using image-caption pairs—to build a cohesive token space suitable for autoregressive medical models. Across 30+ datasets, 9 modalities, and 4 tasks (reconstruction, classification, modality-conditioned synthesis, and VQA), MedITok achieves state-of-the-art performance, demonstrating improved fidelity, interpretability, and downstream task effectiveness. The work highlights the potential of domain-specific unified tokenizers to advance medical multimodal AI, with publicly available code and data pathway for broad adoption and extension.

Abstract

Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.

MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

TL;DR

MedITok introduces a unified visual tokenizer for medical images that encodes both fine-grained visual structure and rich clinical semantics in a single latent space. It leverages a two-stage curriculum-inspired training framework—visual representation alignment on large-scale medical images and textual semantic alignment using image-caption pairs—to build a cohesive token space suitable for autoregressive medical models. Across 30+ datasets, 9 modalities, and 4 tasks (reconstruction, classification, modality-conditioned synthesis, and VQA), MedITok achieves state-of-the-art performance, demonstrating improved fidelity, interpretability, and downstream task effectiveness. The work highlights the potential of domain-specific unified tokenizers to advance medical multimodal AI, with publicly available code and data pathway for broad adoption and extension.

Abstract

Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.

Paper Structure

This paper contains 45 sections, 4 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Performance comparison of different tokenizers on medical image reconstruction (rFID) and classification (mAP). MedITok achieves the best of both worlds.
  • Figure 2: Overview of the proposed training framework. (a) Architecture of MedITok. (b) Two-stage training: visual representation alignment with pretrained visual semantics, followed by textual semantic alignment using clinical captions. (c) Statistics across modalities for our training data.
  • Figure 3: Reconstruction results across multiple imaging modalities. Each reconstructed image is paired with an absolute error map against the input image with PSNR/SSIM values.
  • Figure 4: Modality-conditioned synthesized image examples produced by our LlamaGenMedITok.
  • Figure S1: Overview of the training data for MedITok. Left: exemplar images used in the first training stage. Right: word cloud generated from the captions used in the second training stage.
  • ...and 7 more figures