Table of Contents
Fetching ...

U-VLM: Hierarchical Vision Language Modeling for Report Generation

Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang

TL;DR

U-VLM, which enables hierarchical vision-language modeling in both training and architecture, and multi-layer visual injection that routes U-Net encoder features to corresponding language model layers, achieves state-of-the-art performance on CT-RATE and AbdomenAtlas using only a 0.1B decoder trained from scratch.

Abstract

Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.

U-VLM: Hierarchical Vision Language Modeling for Report Generation

TL;DR

U-VLM, which enables hierarchical vision-language modeling in both training and architecture, and multi-layer visual injection that routes U-Net encoder features to corresponding language model layers, achieves state-of-the-art performance on CT-RATE and AbdomenAtlas using only a 0.1B decoder trained from scratch.

Abstract

Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.
Paper Structure (9 sections, 5 equations, 2 figures, 5 tables)

This paper contains 9 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: U-VLM framework. Stage 1: Segmentation pretraining for learning fine-grained spatial structures. Stage 2: Classification pretraining for disease pattern recognition. Stage 3: Report generation via multi-layer injection (deep encoder $\rightarrow$ early language layers, shallow encoder $\rightarrow$ later layers).
  • Figure 2: Qualitative results of segmentation and report generation on chest CT (CT-RATE) and abdominal CT (AbdomenAtlas 3.0). We visualize the input 3D CT volumes alongside segmentation predictions: Seg(F+L) for chest CT and Seg(C+L) for abdominal CT. For all reports, text is color-coded to highlight abnormalities, maintaining consistent colors for the same pathology, while normal descriptions are shown in black.