Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Renjie Liang; Yiling Ma; Yang Xing; Zhengkang Fan; Jinqian Pan; Chengkun Sun; Li Li; Kuang Gong; Jie Xu

Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Renjie Liang, Yiling Ma, Yang Xing, Zhengkang Fan, Jinqian Pan, Chengkun Sun, Li Li, Kuang Gong, Jie Xu

Abstract

Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at https://github.com/renjie-liang/Adaptive-RAG-for-3DCT-Report-Generation.

Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Abstract

Paper Structure (38 sections, 3 equations, 3 figures, 10 tables)

This paper contains 38 sections, 3 equations, 3 figures, 10 tables.

Introduction
Related Work
3D Medical Report Generation
Retrieval-Augmented Generation in Medical Imaging
Representation Quality in Contrastive Vision-Language Pre-training
Diagnosing the Representational Bottleneck in 3D CT Embeddings
Encoded but Narrow
Retrieval Under the Bottleneck
Method: AdaRAG-CT
Base Model
Organ-Indexed Sentence Database
Adaptive Retrieval Training
Adaptive Retrieval Inference
Two-Stage retrieval.
Text2Text retrieval.
...and 23 more sections

Figures (3)

Figure 1: Overview of AdaRAG-CT. (a) A 3D CT volume is encoded into a global embedding (CT-CLIP) and four organ-specific embeddings (ViSD-Boost), which are projected into the LLM input space as visual tokens. (b) An organ-indexed sentence database is built from training-set reports and indexed with FAISS for efficient retrieval. (c) During report generation, the LLM autonomously emits a [RAG] token when it needs external evidence; retrieved sentences are then injected into the context before the model continues generating.
Figure 2: Six evaluation metrics across training steps for Two-Stage and Text2Text retrieval pipelines on the 8B model, with bootstrap 95% confidence intervals.
Figure 3: Qualitative comparison for an aorta finding. Left: base model output; centre: AdaRAG-CT output with the regenerated sentence highlighted; right: ground-truth report. Red highlights indicate hallucinated content; green indicates correct clinical findings.

Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Abstract

Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Authors

Abstract

Table of Contents

Figures (3)