Table of Contents
Fetching ...

Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao

Abstract

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

Abstract

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

Paper Structure

This paper contains 31 sections, 9 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Overview of the pretraining framework. (a) Image-report pairs with optional clinical context. (b) Heatmap-transcript pairs capturing physicians' visual attention. (c-d) Comparison between existing methods and ours. (e) Image-text retrieval results on the MIMIC-5x200 dataset zhou2024benchx. Clinical context and gaze data are available for a subset of cases. "TS" denotes transcript.
  • Figure 2: (A) Overview of our CoGaze. (B) Context-infused vision encoder. Gaze supervision is used only during pretraining.
  • Figure 3: Workflow of downstream tasks. All tasks are performed without requiring gaze heatmaps.
  • Figure 4: Zero-shot image-text retrieval on MIMIC-5x200 dataset. P@K: precision at top-K; R@K: recall at top-K.
  • Figure 5: Comparison of supervised classification performance in terms of AUROC on the Shenzhen 2023-shenzhencxr dataset.
  • ...and 10 more figures