Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Zhixuan Chen, Yequan Bie, Haibo Jin, Hao Chen
TL;DR
The paper tackles the challenge of CT report generation by moving beyond global volume features to region-aware analysis using universal segmentation masks. It introduces Reg2RG, a region-guided referring and grounding framework that decouples local features into texture and geometry, and fuses them with global context through a region-report alignment strategy and a large-language-model decoder. Empirical results on RadGenome-ChestCT and CTRG-Chest-584K demonstrate improvements in natural language generation and clinical efficacy metrics, along with enhanced interpretability via explicit region grounding. The approach offers a practical impact by delivering more accurate, region-specific CT reports with reliable regional grounding, while acknowledging limitations related to segmentation quality and plans for future lesion-level and multimodal extensions.
Abstract
Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG.
