Table of Contents
Fetching ...

See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

Ruinan Jin, Gexin Huang, Xinwei Shen, Qiong Zhang, Yan Shuo Tan, Xiaoxiao Li

TL;DR

This work addresses the difficulty of medical image diagnosis due to subtle, localized findings and inter-patient variability by introducing See-in-Pairs (SiP), a reference-image guided framework that enables cross-subject comparative reasoning in vision-language models. SiP combines a query image with healthy-control references and uses structured prompts, plus a lightweight comparative supervised fine-tuning (SFT) strategy, to inject clinically relevant comparison capabilities into open VLMs. The authors systematically evaluate zero-shot (off-the-shelf) and SFT scenarios across six medical datasets spanning radiology, ophthalmology, and dermatology, and show robust gains across reference-selection strategies (random, demographic, embedding-based, cross-center) and reference scales. They provide mechanistic and interpretability evidence, including attribution analyses, showing that comparison sharpens pathology-focused representations and improves sample efficiency. Overall, SiP offers a practical, scalable path to clinically aligned, reference-guided medical VLMs with demonstrated improvements in accuracy and robustness across diverse tasks.

Abstract

Medical image diagnosis is challenging because many diseases resemble normal anatomy and exhibit substantial interpatient variability. Clinicians routinely rely on comparative diagnosis, such as referencing cross-patient healthy control images to identify subtle but clinically meaningful abnormalities. Although healthy reference images are abundant in practice, existing medical vision-language models (VLMs) primarily operate in a single-image or single-series setting and lack explicit mechanisms for comparative diagnosis. This work investigates whether incorporating clinically motivated comparison can enhance VLM performance. We show that providing VLMs with both a query image and a matched healthy reference image, accompanied by cross-patient comparative prompts, significantly improves diagnostic performance. This performance can be further augmented by lightweight supervised fine-tuning (SFT) on a small amount of data. At the same time, we evaluate multiple strategies for selecting reference images, including random sampling, demographic attribute matching, embedding-based retrieval, and cross-center selection, and find consistently strong performance across all settings. Finally, we investigate why comparative diagnosis is effective theoretically, and observe improved sample efficiency and tighter alignment between visual and textual representations. Our findings highlight the clinical relevance of comparison-based diagnosis, provide practical strategies for incorporating reference images into VLMs, and demonstrate improved performance across diverse medical imaging tasks.

See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

TL;DR

This work addresses the difficulty of medical image diagnosis due to subtle, localized findings and inter-patient variability by introducing See-in-Pairs (SiP), a reference-image guided framework that enables cross-subject comparative reasoning in vision-language models. SiP combines a query image with healthy-control references and uses structured prompts, plus a lightweight comparative supervised fine-tuning (SFT) strategy, to inject clinically relevant comparison capabilities into open VLMs. The authors systematically evaluate zero-shot (off-the-shelf) and SFT scenarios across six medical datasets spanning radiology, ophthalmology, and dermatology, and show robust gains across reference-selection strategies (random, demographic, embedding-based, cross-center) and reference scales. They provide mechanistic and interpretability evidence, including attribution analyses, showing that comparison sharpens pathology-focused representations and improves sample efficiency. Overall, SiP offers a practical, scalable path to clinically aligned, reference-guided medical VLMs with demonstrated improvements in accuracy and robustness across diverse tasks.

Abstract

Medical image diagnosis is challenging because many diseases resemble normal anatomy and exhibit substantial interpatient variability. Clinicians routinely rely on comparative diagnosis, such as referencing cross-patient healthy control images to identify subtle but clinically meaningful abnormalities. Although healthy reference images are abundant in practice, existing medical vision-language models (VLMs) primarily operate in a single-image or single-series setting and lack explicit mechanisms for comparative diagnosis. This work investigates whether incorporating clinically motivated comparison can enhance VLM performance. We show that providing VLMs with both a query image and a matched healthy reference image, accompanied by cross-patient comparative prompts, significantly improves diagnostic performance. This performance can be further augmented by lightweight supervised fine-tuning (SFT) on a small amount of data. At the same time, we evaluate multiple strategies for selecting reference images, including random sampling, demographic attribute matching, embedding-based retrieval, and cross-center selection, and find consistently strong performance across all settings. Finally, we investigate why comparative diagnosis is effective theoretically, and observe improved sample efficiency and tighter alignment between visual and textual representations. Our findings highlight the clinical relevance of comparison-based diagnosis, provide practical strategies for incorporating reference images into VLMs, and demonstrate improved performance across diverse medical imaging tasks.

Paper Structure

This paper contains 32 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Clinical motivation of this paper: clinicians leverage healthy-control reference images to aid diagnosis, yet existing datasets are dominated by healthy images.
  • Figure 2: Overview of our study pipeline. (a) $\mathtt{SiP}$ method overview: we first explored the off-the-shelf inference performance when leveraging healthy-control reference images and then conducted comparative SFT by constructing (query, reference, label) triples and fine-tuning the VLMs (e.g., Qwen, NVILA, Phi-3). We additionally study several clinically inspired reference-selection strategies, including random sampling, demographic matching, embedding-based retrieval, and cross-center sampling. (b) Analysis: pairing query images with reference images reduces nuisance variation and improves feature alignment, enabling more reliable diagnostic performance compared to a single-image setting.
  • Figure 3: Impact of the number of reference images. Each column corresponds to one data modality (Pneumonia, Edema, Glaucoma, Melanoma, Retinopathy). The $x$-axis shows the ratio $K$ of reference images to each query image ($5\times,10\times,20\times,30\times$). The top row reports balanced accuracy (BAcc.) and the bottom row reports F1.
  • Figure 4: Attribution maps for single-image inference vs. SiP. We visualize pixel-level attributions on the query image for three representative example sets (Case 1--3) across three modalities (top-to-bottom: chest X-ray, fundus photography, dermoscopy). For each query image (Query), we overlay-sensitivity attributions for a single-image SFT (Single Attr.) and a $\mathtt{SiP}$ comparison SFT (SiP Attr.). For $\mathtt{SiP}$, the model receives the query together with a matched healthy-control reference image and an explicit comparison instruction. Color encodes attribution strength: warmer colors (yellow/red) indicate higher attribution magnitude (greater sensitivity to occlusion), while cooler colors (blue/green) indicate weaker attribution.