Table of Contents
Fetching ...

Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

Zhusi Zhong, Yuli Wang, Lulu Bi, Zhuoqi Ma, Sun Ho Ahn, Christopher J. Mullin, Colin F. Greineder, Michael K. Atalay, Scott Collins, Grayson L. Baird, Cheng Ting Lin, Webster Stayman, Todd M. Kolb, Ihab Kamel, Harrison X. Bai, Zhicheng Jiao

TL;DR

This work introduces Abn-BLIP, a PE-specific vision-language framework for CTPA that aligns abnormal findings with structured radiology reports via anatomy-guided multi-abnormality identification, abnormality-driven visual querying (Abn-QFormer), and abnormality-aligned bootstrapping learning (ACL and ATG). The model achieves state-of-the-art performance on abnormality diagnosis and 3D CTPA report generation across BUH and INSPECT datasets, supported by qualitative analyses and expert evaluations. Its two-stage training, modular architecture, and region-wise reporting enable interpretable, clinically coherent outputs with competitive efficiency, suggesting strong potential for real-world radiology workflow integration. The approach highlights the value of abnormality-centric cross-modal alignment and structured reporting in enhancing diagnostic accuracy and radiology workflow efficiency.

Abstract

Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining), an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, our model demonstrates superior performance in detecting abnormalities, reducing missed findings, and generating structured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms state-of-the-art medical vision-language models and 3D report generation methods in both accuracy and clinical relevance. These results highlight the potential of integrating multimodal learning strategies for improving radiology reporting. The source code is available at https://github.com/zzs95/abn-blip.

Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

TL;DR

This work introduces Abn-BLIP, a PE-specific vision-language framework for CTPA that aligns abnormal findings with structured radiology reports via anatomy-guided multi-abnormality identification, abnormality-driven visual querying (Abn-QFormer), and abnormality-aligned bootstrapping learning (ACL and ATG). The model achieves state-of-the-art performance on abnormality diagnosis and 3D CTPA report generation across BUH and INSPECT datasets, supported by qualitative analyses and expert evaluations. Its two-stage training, modular architecture, and region-wise reporting enable interpretable, clinically coherent outputs with competitive efficiency, suggesting strong potential for real-world radiology workflow integration. The approach highlights the value of abnormality-centric cross-modal alignment and structured reporting in enhancing diagnostic accuracy and radiology workflow efficiency.

Abstract

Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining), an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, our model demonstrates superior performance in detecting abnormalities, reducing missed findings, and generating structured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms state-of-the-art medical vision-language models and 3D report generation methods in both accuracy and clinical relevance. These results highlight the potential of integrating multimodal learning strategies for improving radiology reporting. The source code is available at https://github.com/zzs95/abn-blip.

Paper Structure

This paper contains 21 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Abn-BLIP inference pipeline for CTPA abnormality identification and structured report generation. The Abn-IDed image encoder detects 32 CTPA abnormalities and extracts abnormality-identified features. The learned visual queries interrogate CTPA scans by Abn-QFormer to extract the corresponding abnormal findings. These queries help generate a structured CTPA report, categorizing abnormalities under relevant organ-specific sections, such as pulmonary arteries and the heart.
  • Figure 2: The figure illustrates the population distribution of 32 CTPA abnormalities across two datasets (BUH and INSPECT), categorized into 7 anatomical regions: Pulmonary Arteries, Lungs and Airways, Pleura, Heart, Mediastinum and Hila, Chest Wall and Lower Neck, and Bones. This hierarchical framework facilitates comprehensive abnormality detection and enhances the generation of clinically meaningful CTPA reports. The abnormality labels were extracted from radiology reports using a large language model (LLM), enabling a multi-dimensional assessment of inter-regional variations across the datasets.
  • Figure 3: Overview of the proposed Abn-BLIP model for CTPA abnormality diagnosis and report generation. (a) Anatomy-guided multi-abnormality identification in Stage 1: Multi-scale abnormality-identified image feature extraction for transformer encoders. (b) Abnormality-driven visual Querying Transformers (Abn-QFormer): Joint optimization of two objectives, enforcing abnormal queries (a set of learnable embeddings) to extract visual abnormal representations most relevant to their corresponding abnormal text descriptions. (c) Abnormality-aligned Contrastive Learning (ACL): Achieving more fine-grained visual queried representations by aligning abnormalities.
  • Figure 4: Visualization of cross-modal cosine similarity heatmap between textual and visual features of 32 distinct CTPA abnormalities. The textual features are derived from the text descriptions of each abnormality, while the visual features are the queried representations on the corresponding images. Each cell in the heatmap indicates the similarity score between a specific abnormality's textual and visual representation, providing insights into the alignment between the two modalities
  • Figure 5: t-SNE visualization of normalized image and text features for abnormalities. Each colored point represents one of 32 detected abnormalities, from 20,000 randomly sampled features. (a) The abnormal image features were extracted using visual querying, guided by learned abnormality-wise queries from the visual querying transformer encoder. (b) The abnormal text features were encoded by a text transformer encoder based on descriptive sentences of the abnormalities.
  • ...and 3 more figures