Table of Contents
Fetching ...

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Zhixuan Chen, Yequan Bie, Haibo Jin, Hao Chen

TL;DR

The paper tackles the challenge of CT report generation by moving beyond global volume features to region-aware analysis using universal segmentation masks. It introduces Reg2RG, a region-guided referring and grounding framework that decouples local features into texture and geometry, and fuses them with global context through a region-report alignment strategy and a large-language-model decoder. Empirical results on RadGenome-ChestCT and CTRG-Chest-584K demonstrate improvements in natural language generation and clinical efficacy metrics, along with enhanced interpretability via explicit region grounding. The approach offers a practical impact by delivering more accurate, region-specific CT reports with reliable regional grounding, while acknowledging limitations related to segmentation quality and plans for future lesion-level and multimodal extensions.

Abstract

Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG.

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

TL;DR

The paper tackles the challenge of CT report generation by moving beyond global volume features to region-aware analysis using universal segmentation masks. It introduces Reg2RG, a region-guided referring and grounding framework that decouples local features into texture and geometry, and fuses them with global context through a region-report alignment strategy and a large-language-model decoder. Empirical results on RadGenome-ChestCT and CTRG-Chest-584K demonstrate improvements in natural language generation and clinical efficacy metrics, along with enhanced interpretability via explicit region grounding. The approach offers a practical impact by delivering more accurate, region-specific CT reports with reliable regional grounding, while acknowledging limitations related to segmentation quality and plans for future lesion-level and multimodal extensions.

Abstract

Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG.

Paper Structure

This paper contains 29 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Our method vs. the vanilla method. The gray background highlights instances of incorrect diagnosis. For ease of comparison, we divide the report into region-level sections. (a) The vanilla method based on global features is prone to neglecting some abnormalities since it fails to explore local details. (b) In contrast, our method can correctly detect all abnormalities with the region-guided local features.
  • Figure 2: Overview of the proposed Reg2RG framework. It integrates global and local features as visual embeddings for the LLM to generate reports. Global features are encoded from the entire volume, while local features are extracted using segmentation masks to capture lesion details in sub-regions. The local features are decoupled into texture and geometry, where texture is derived from cropped masked volumes and geometry is obtained from the uncropped masks. Shuffling local features across various regions enhances the alignment between visual regions and their corresponding reports. The LLM focuses on each region individually to produce accurate and detailed region-specific reports.
  • Figure 3: The report length distributions of the ground-truth reports, along with those generated by our proposed method and the SOTA MedVInT zhang2023pmc. The Kullback-Leibler (KL) divergence is utilized to quantify the differences between the distributions of our method and MedVInT relative to the ground-truth reports.
  • Figure 4: Case studies of our model and the SOTA MedVInT zhang2023pmc. The different colors represent distinct anatomical areas, as shown at the bottom of each example. The gray background highlights incorrect diagnoses.
  • Figure 5: Region-level reports generated by our model. Each regional report refers to a specific region and is grounded in the anatomical area depicted in the left figure. The different colors correspond to distinct anatomical regions.
  • ...and 1 more figures