Table of Contents
Fetching ...

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy, Seyda Ertekin, Sila Kurugol

TL;DR

The paper addresses the difficulty of retrieving pediatric wrist radiographs with analogous fracture patterns, where global image representations miss subtle regional cues. It introduces WristMIR, a region-aware retrieval framework that combines global wrist embeddings with bone-specific crops using structured radiology reports mined by MedGemma, trained with a dual-encoder CLIP setup and a multi-positive loss. A two-stage coarse-to-fine retrieval pipeline first selects anatomically consistent candidates and then reranks them by region-specific embeddings, yielding substantial gains in retrieval accuracy and fracture-diagnosis metrics (e.g., Recall@5 improving from 0.82% to 9.35%, AUROC 0.949, AUPRC 0.953). The approach demonstrates the value of anatomy-guided retrieval for clinical decision support in pediatrics and provides a scalable framework for radiology image-text learning, with limitations including detector/report quality dependence and single-institution evaluation, suggesting future cross-domain validation and classify-then-retrieve studies.

Abstract

Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

TL;DR

The paper addresses the difficulty of retrieving pediatric wrist radiographs with analogous fracture patterns, where global image representations miss subtle regional cues. It introduces WristMIR, a region-aware retrieval framework that combines global wrist embeddings with bone-specific crops using structured radiology reports mined by MedGemma, trained with a dual-encoder CLIP setup and a multi-positive loss. A two-stage coarse-to-fine retrieval pipeline first selects anatomically consistent candidates and then reranks them by region-specific embeddings, yielding substantial gains in retrieval accuracy and fracture-diagnosis metrics (e.g., Recall@5 improving from 0.82% to 9.35%, AUROC 0.949, AUPRC 0.953). The approach demonstrates the value of anatomy-guided retrieval for clinical decision support in pediatrics and provides a scalable framework for radiology image-text learning, with limitations including detector/report quality dependence and single-institution evaluation, suggesting future cross-domain validation and classify-then-retrieve studies.

Abstract

Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.
Paper Structure (25 sections, 4 equations, 5 figures, 8 tables)

This paper contains 25 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Data preprocessing pipeline. (a) YOLOv11 detector first identifies the wrist region of interest (ROI), applies CLAHE enhancement, and then localizes and crops three anatomical regions (distal radius, distal ulna, and ulnar styloid). (b) MedGemma-27B converts each radiology report into a structured representation capturing anatomy-specific findings, which are then used to generate global exam-level captions and region-specific captions aligned with each bone crop.
  • Figure 2: WristMIR architecture. A query wrist radiograph is encoded to generate both global and bone-level embeddings. A YOLOv11 detector identifies the relevant bone regions (e.g., distal radius, distal ulna, ulnar styloid). The global embedding is used to retrieve the top-$k$ most similar exams from a precomputed database, after which these candidates are reranked using the region-specific embeddings to enable fine-grained, anatomy-aware retrieval.
  • Figure 3: WristMIR attention maps. The model consistently attends to fracture-relevant regions, focusing on localized morphological cues. Bounding boxes are shown only to guide visual interpretation of the fracture location and were not included in the dataset or were not used during CLIP training.
  • Figure 4: Comparison of single- and two-stage retrieval. Region-conditioned reranking retrieves cases that are anatomically and fracture-pattern aligned, whereas single-stage retrieval often surfaces globally similar but pathologically mismatched images. Numbers indicate scores assigned by a pediatric radiologist, showing higher and more clinically relevant retrieval for the proposed two-stage method.
  • Figure 5: WristMIR bone-level attention maps. For each anatomical region (distal radius, distal ulna, and ulnar styloid), the model concentrates its attention on localized morphological cues that align with fracture-relevant structures. The dashed bounding boxes are included only to guide the reader by indicating the approximate fracture locations.