Unleashing Video Language Models for Fine-grained HRCT Report Generation

Yingying Fang; Huichi Zhou; KinHei Lee; Yijia Wang; Zhenxuan Zhang; Jiahao Huang; Guang Yang

Unleashing Video Language Models for Fine-grained HRCT Report Generation

Yingying Fang, Huichi Zhou, KinHei Lee, Yijia Wang, Zhenxuan Zhang, Jiahao Huang, Guang Yang

Abstract

Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/

Unleashing Video Language Models for Fine-grained HRCT Report Generation

Abstract

Paper Structure (8 sections, 3 equations, 3 figures, 1 table)

This paper contains 8 sections, 3 equations, 3 figures, 1 table.

Introduction
Method
Architecture of Video-Language Models
Abnormality-centric chain-of-thoughts
Fine-grained abnormality discrimination
Experimental details
Experimental results
Conclusion

Figures (3)

Figure 1: Left: overall framework for adapting current VideoLMs with the proposed AbSteering in two stages: (a) the typical backbone of current VideoLMs; (b) Stage 1, abnormality-centric CoT training; and (c) Stage 2, fine-grained abnormality discrimination via DPO. Right: examples of abnormality-centric and manipulated report samples.
Figure 2: Case Study: Comparison of generated reports. Red text denotes ground-truth abnormalities, while green indicates clinical findings absent from the original reports. Total counts for matched and overpredicted findings are provided.
Figure 3: Ablation study on the (a) AbSteering strategy, (b) visual encoder architectures, and (c) LLM parameter scales.

Unleashing Video Language Models for Fine-grained HRCT Report Generation

Abstract

Unleashing Video Language Models for Fine-grained HRCT Report Generation

Authors

Abstract

Table of Contents

Figures (3)