More Than Positive and Negative: Communicating Fine Granularity in Medical Diagnosis
Xiangyu Peng, Kai Wang, Jianfei Yang, Yingying Zhu, Yang You
TL;DR
This work addresses the mismatch between binary chest X-ray diagnosis and real-world clinical variability within positive findings. It introduces a fine-grained benchmark that splits positive findings into atypical and typical positives based on severity and time-based change, quantified with $AUC^{\\text{FG}}$. To learn this granularity from coarse labels, the authors propose PU-RM, a risk-modulated training framework using the Partially Hubberised Cross Entropy loss with a tangent point $\tau$. On the MIMIC-CXR-JPG dataset, PU-RM yields higher $AUC^{\\text{FG}}$ on consolidation and edema than baseline uncertainty methods, supported by CAM visualizations showing more appropriate activation patterns. Together, these results offer a practical baseline and a step toward AI diagnoses that communicate clinically meaningful fine-grained knowledge.
Abstract
With the advance of deep learning, much progress has been made in building powerful artificial intelligence (AI) systems for automatic Chest X-ray (CXR) analysis. Most existing AI models are trained to be a binary classifier with the aim of distinguishing positive and negative cases. However, a large gap exists between the simple binary setting and complicated real-world medical scenarios. In this work, we reinvestigate the problem of automatic radiology diagnosis. We first observe that there is considerable diversity among cases within the positive class, which means simply classifying them as positive loses many important details. This motivates us to build AI models that can communicate fine-grained knowledge from medical images like human experts. To this end, we first propose a new benchmark on fine granularity learning from medical images. Specifically, we devise a division rule based on medical knowledge to divide positive cases into two subcategories, namely atypical positive and typical positive. Then, we propose a new metric termed AUC$^\text{FG}$ on the two subcategories for evaluation of the ability to separate them apart. With the proposed benchmark, we encourage the community to develop AI diagnosis systems that could better learn fine granularity from medical images. Last, we propose a simple risk modulation approach to this problem by only using coarse labels in training. Empirical results show that despite its simplicity, the proposed method achieves superior performance and thus serves as a strong baseline.
