Table of Contents
Fetching ...

Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas

TL;DR

Anatomy-VLM tackles the challenge of fine-grained, interpretable disease interpretation from radiographs by aligning region-level visual features with structured anatomical knowledge. The model introduces anatomy-aware queries within a vision transformer to detect 29 anatomical regions, enrich region representations with medical knowledge, and perform region- and image-level disease classification through multi-scale contrastive learning. Empirical results show strong zero-shot performance, robustness to distribution shifts, and improved segmentation accuracy on heart and pneumonia tasks, underscoring the benefit of anatomically grounded, multi-scale alignment over holistic image–text matching. The approach offers Clinically interpretable predictions and generalizes across modalities, suggesting broad applicability to radiology and beyond.

Abstract

Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

TL;DR

Anatomy-VLM tackles the challenge of fine-grained, interpretable disease interpretation from radiographs by aligning region-level visual features with structured anatomical knowledge. The model introduces anatomy-aware queries within a vision transformer to detect 29 anatomical regions, enrich region representations with medical knowledge, and perform region- and image-level disease classification through multi-scale contrastive learning. Empirical results show strong zero-shot performance, robustness to distribution shifts, and improved segmentation accuracy on heart and pneumonia tasks, underscoring the benefit of anatomically grounded, multi-scale alignment over holistic image–text matching. The approach offers Clinically interpretable predictions and generalizes across modalities, suggesting broad applicability to radiology and beyond.

Abstract

Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of methodology differences among human-expert workflow, standard VLM and our Anatomy-VLM that is inspired from a radiologist's workflow. (a) Human-expert radiologist's approach first identifies anatomical structures, then performs a region-specific assessment of each area, identifies abnormalities, and finally synthesizes the findings into a coherent clinical report. (b) Conventional vision–language methods embed the entire chest X-ray as a single global feature and match it to textual concepts via cosine similarity, showing limits on spatial precision and interpretability. (c) Anatomy-VLM emphasizes the anatomy-wise contrastive learning. We partition the radiograph into clinically meaningful regions, generate a dedicated embedding for each region, and contrast these embeddings against structured anatomical concepts. Our approach delivers fine-grained, interpretable predictions that align with the radiologist’s interpretation pipeline.
  • Figure 2: Overview of Anatomy-VLM. The model processes medical images and clinical text to generate global and region-specific alignments, using anatomy queries to detect and localize anatomical structures. We design the model to follow human expert diagnostic interpretation with three components: Anatomical Region Detection, Region-specific Alignment, and Global Alignment.
  • Figure 3: Evaluation of VLP encoders on segmentation tasks. The figure shows segmentation results for chest X-ray images under four different conditions: ChexMask Frozen, ChexMask Transfer, SIIM-ACR Frozen, and SIIM-ACR Transfer. For each condition, the original image and ground truth mask are shown alongside segmentation outputs from five different methods: CLIP, Biomed CLIP, BioViL, MedKLIP and Anatomy-VLM (Ours). The segmentation masks are displayed as heatmaps overlaid on the original images, with warmer colors indicating higher confidence regions. Results demonstrate rich encoder representations of Anatomy-VLM that effectively capture anatomical knowledge and transfer well to improve segmentation performance.