Table of Contents
Fetching ...

DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

Yuhe Tian, Kun Zhang, Haoran Ma, Rui Yan, Yingtai Li, Rongsheng Wang, Shaohua Kevin Zhou

Abstract

While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.

DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

Abstract

While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.
Paper Structure (45 sections, 11 equations, 8 figures, 10 tables)

This paper contains 45 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of conventional methods and our proposed framework. (a) Previous methods feed whole-volume visual tokens uniformly into an LLM for report generation. (b) In clinical practice, radiologists often compare the current scan image with a normal reference image to identify differences. (c) Inspired by this, our method leverages a normal reference prior to derive deviation-aware visual prompts, guiding the LLM to generate more accurate and fine-grained reports.
  • Figure 2: Overall architecture. The framework takes a target CT volume and a normal reference CT as inputs. A shared visual encoder and resampler produce aligned latent tokens ($I$ and $I^{r}$). A difference-aware module then derives (a) a global delta $\Delta_{\text{global}}$ by applying a Transformer with a learnable query over $I$ and $I^{r}$, and (b) a local delta $\Delta_{\text{local}}$ by aggregating token-wise residuals with distance-based importance weights. These two signals are fused by (c) a difference-to-prompt generator to form a visual difference prompt, which is prepended as a soft prefix to an LLM (LoRA-tuned) to guide medical report generation.
  • Figure 3: Visual token importance. (a) Normalized importance distributions for $N{=}32$ latent tokens with (blue) and without (orange) the proposed $\Delta_{\text{prompt}}$. (b) Performance gain by $\Delta_{\text{prompt}}$.
  • Figure 4: LLM-as-a-Judge Evaluation.We employ GPT-5 for pairwise comparisons across five clinical axes: Granularity, Localization, Quantification, Sensitivity, and Utility
  • Figure 5: Visual semantic discrepancy and discrepancy-level comparison. (a) Two examples showing the input CT, the Visual Semantic Difference Map produced by our method, and the paired normal reference. (b) Performance comparison among w/o Diff, Pixel-level Diff (pixel-wise subtraction), and Ours on BLEU-1 and CE-F1.
  • ...and 3 more figures