LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation
Steven Song, Anirudh Subramanyam, Irene Madejski, Robert L. Grossman
TL;DR
LaB-RAG introduces a label-boosted retrieval augmented generation framework for radiology report generation that avoids fine-tuning large models. It derives radiology-specific textual labels from zero-shot image embeddings using lightweight LaB-Classifiers, then uses these labels to filter and format retrieved exemplars before prompting a general-domain LLM via in-context learning. Across MIMIC-CXR and CheXpert Plus, LaB-RAG achieves state-of-the-art F1-CheXbert findings and competitive RadGraph performance, with ablations showing additive gains from label filtering and prompt formatting. The approach demonstrates that modular, low-cost components can meaningfully boost radiology report generation and can synergize with existing fine-tuning methods for further gains.
Abstract
In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that fine-tuning of large, bespoke models is required to improve model generation accuracy. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a small-model-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG) over MIMIC-CXR and CheXpert Plus. We argue that simple classification models combined with zero-shot embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image embedding models specifically for the task, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further conduct extensive ablation experiments to better understand the components of LaB-RAG. Our results suggest broader compatibility and synergy with fine-tuned methods to further enhance RRG performance.
