When are radiology reports useful for training medical image classifiers?
Herman Bergström, Zhongqi Yue, Fredrik D. Johansson
TL;DR
This work investigates when radiology reports can improve training of image-only medical classifiers across diagnostic and prognostic tasks, using benchmarks built from MIMIC-CXR and INSPECT. It systematically compares pre-training strategies (self-supervised learning, image-text alignment, and masked image-text modelling) and fine-tuning with privileged information via distillation, employing a two-stage objective where a teacher with access to $(V(\mathbf{X}), T(\mathbf{Z}))$ guides an image-only student with a blended loss: $(1-\lambda)L(f(V(\mathbf{X})), Y) + \lambda L(f(V(\mathbf{X})), g^{\tau}(V(\mathbf{X}), T(\mathbf{Z})))$, with $\lambda \in [0,1]$ and $\tau>0$. The study finds that pre-training with report supervision benefits diagnostic tasks when labels align with the text, but explicit image-text alignment can hurt generalization for labels weakly represented in the reports; combining text supervision with self-supervision mitigates this issue. It also shows that distillation during fine-tuning can yield substantial accuracy gains, sometimes exceeding those from pre-training, though the benefits are task- and backbone-dependent and can be negated when the text is overly predictive. Overall, the results provide actionable guidance on when and how to leverage privileged radiology text to train medical image classifiers while highlighting limitations and areas for future work in this area.
Abstract
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
