Table of Contents
Fetching ...

ChexFract: From General to Specialized -- Enhancing Fracture Description Generation

Nikolay Nechaev, Evgeniia Przhezdzetskaia, Dmitry Umerenkov, Dmitry V. Dylov

TL;DR

The paper tackles the challenge of generating accurate fracture descriptions from chest X-ray radiology reports. It introduces ChexFract, a fracture-focused dataset built via sentence extraction and location-specific templating, and trains fracture-focused vision-language models using domain-specific encoders with Phi-3.5. The study demonstrates that end-to-end encoder adaptation and templated supervision yield meaningful gains over general-purpose radiology models, achieving ROC-AUC up to 0.715 and improved F1 for fracture detection. The authors publicly release their best-performing fracture-reporting models and discuss clinical implications, including a recall-precision tradeoff suitable for screening workflows with radiologist review.

Abstract

Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.

ChexFract: From General to Specialized -- Enhancing Fracture Description Generation

TL;DR

The paper tackles the challenge of generating accurate fracture descriptions from chest X-ray radiology reports. It introduces ChexFract, a fracture-focused dataset built via sentence extraction and location-specific templating, and trains fracture-focused vision-language models using domain-specific encoders with Phi-3.5. The study demonstrates that end-to-end encoder adaptation and templated supervision yield meaningful gains over general-purpose radiology models, achieving ROC-AUC up to 0.715 and improved F1 for fracture detection. The authors publicly release their best-performing fracture-reporting models and discuss clinical implications, including a recall-precision tradeoff suitable for screening workflows with radiologist review.

Abstract

Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.

Paper Structure

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: ROC curves illustrating performance comparison across different encoder configurations (MAIRA-2 and CheXagent), text types (original and templated), and training conditions (frozen/unfrozen encoders). Each curve demonstrates the tradeoff between sensitivity (recall) and specificity across varying decision thresholds. Each point on the graph corresponds to a single model
  • Figure 2: Balanced accuracy for the "Side" classification task across different model architectures. The solid line shows the mean accuracy averaged across multiple runs for each checkpoint, while the shaded area represents the standard deviation.
  • Figure 3: Balanced Accuracy for the "Stage" classification task across different model architectures. The solid line shows the mean accuracy averaged across multiple runs for each checkpoint, while the shaded area represents the standard deviation.