Table of Contents
Fetching ...

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Ziyang Zhang, Yang Yu, Yucheng Chen, Xulei Yang, Si Yong Yeo

TL;DR

MedUnifier tackles the annotation bottleneck in medical imaging by unifying vision-language pre-training with a text-grounded image generation pathway that uses discrete visual representations. The framework combines an image-text encoder, a text generator, and a VQ-VAE–based image generator, trained with ITC, ITM, ITG, and TIG losses to align modalities and enable generation. It demonstrates state-of-the-art performance across uni-, cross-, and multi-modal medical tasks on radiology benchmarks, and shows that TIG improves visual and textual synthesis while maintaining efficient training. The approach advances clinical AI by delivering a generalizable, generation-augmented VLP model capable of producing realistic medical images and reports from text prompts, with strong potential for data augmentation and robust cross-modal reasoning.

Abstract

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

TL;DR

MedUnifier tackles the annotation bottleneck in medical imaging by unifying vision-language pre-training with a text-grounded image generation pathway that uses discrete visual representations. The framework combines an image-text encoder, a text generator, and a VQ-VAE–based image generator, trained with ITC, ITM, ITG, and TIG losses to align modalities and enable generation. It demonstrates state-of-the-art performance across uni-, cross-, and multi-modal medical tasks on radiology benchmarks, and shows that TIG improves visual and textual synthesis while maintaining efficient training. The approach advances clinical AI by delivering a generalizable, generation-augmented VLP model capable of producing realistic medical images and reports from text prompts, with strong potential for data augmentation and robust cross-modal reasoning.

Abstract

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

Paper Structure

This paper contains 39 sections, 16 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our MedUnifier framework incorporates learnable embeddings to enable multi-modal interactions. The red components focus on the initial extraction of visual features and the reconstruction of medical images. The green elements are dedicated to the modelling and interpretation of medical reports. Meanwhile, the blue components apply a range of attention-masking strategies to achieve a comprehensive fusion of image and text representations.
  • Figure 2: Left: model architecture consists of an image-text encoder, a text generator, and an image generator to extract the most relevant visual and textual representations by optimizing four distinctive loss functions (ITM, ITC, ITG, TIG). Right: self-attention masking strategies for different learning objectives. Bottom: detailed learning objectives. Integrating visual and textual information enables deep fusion through cross-modal interaction and allows each modality to be processed independently for uni-modal generation.
  • Figure 3: Comparison of ground truth and generated radiology reports reveals strong semantic alignment. In the top figure, both reports describe normal heart size, no pneumothorax or pleural effusion, and a normal cardiomediastinal silhouette, with the generated text adding details on osseous structures/intrathoracic processes. In the bottom figure, both reports align on pneumothorax and cardiomegaly. The same colours denote matched content between the generated sequences and the ground truth report.
  • Figure 4: The detailed model structure of latent adapters. The learned embeddings are fed into top adapter as input while text representation concatenated with local preliminary visual embeddings are put into bottom adapters.
  • Figure 5: The additional comparison between real radiographs and reconstructed ones.
  • ...and 1 more figures