CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting
Naman Sharma
TL;DR
The work addresses the challenge of reliable chest X-ray interpretation using vision-language models and the risks of confident hallucinations in radiology reports. It systematically evaluates state-of-the-art VLMs (e.g., CheXagent, BioViL-T, BLIP-2) and introduces an uncertainty-aware, agent-based reporting workflow that leverages a probed vision encoder and phrase grounding to generate findings with calibrated confidence. Key contributions include detailed linear probing of CheXagent components, a modular CXR-agent architecture that combines pathology detection, localisation, and LLM-driven report generation, and a data-collection platform for expert clinical evaluation. Findings show that vision encoders from foundational models generalize well across datasets and that LLM bottlenecks and data diversity limit end-to-end performance; the proposed uncertainty-aware approach improves interpretability and safety, though localization and data scarcity remain critical challenges. The work underscores the need for larger, diverse paired scan-report datasets and careful normal-vs-abnormal evaluation to enable safe, clinically useful AI radiology tools.
Abstract
Recently large vision-language models have shown potential when interpreting complex images and generating natural language descriptions using advanced reasoning. Medicine's inherently multimodal nature incorporating scans and text-based medical histories to write reports makes it conducive to benefit from these leaps in AI capabilities. We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation across several datasets and benchmarks. We use linear probes to evaluate the performance of various components including CheXagent's vision transformer and Q-former, which outperform the industry-standard Torch X-ray Vision models across many different datasets showing robust generalisation capabilities. Importantly, we find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. Based on these findings, we develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools to generate uncertainty-aware radiology reports with pathologies localised and described based on their likelihood. We thoroughly evaluate our vision-language agents using NLP metrics, chest X-ray benchmarks and clinical evaluations by developing an evaluation platform to perform a user study with respiratory specialists. Our results show considerable improvements in accuracy, interpretability and safety of the AI-generated reports. We stress the importance of analysing results for normal and abnormal scans separately. Finally, we emphasise the need for larger paired (scan and report) datasets alongside data augmentation to tackle overfitting seen in these large vision-language models.
