See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis
Ruinan Jin, Gexin Huang, Xinwei Shen, Qiong Zhang, Yan Shuo Tan, Xiaoxiao Li
TL;DR
This work addresses the difficulty of medical image diagnosis due to subtle, localized findings and inter-patient variability by introducing See-in-Pairs (SiP), a reference-image guided framework that enables cross-subject comparative reasoning in vision-language models. SiP combines a query image with healthy-control references and uses structured prompts, plus a lightweight comparative supervised fine-tuning (SFT) strategy, to inject clinically relevant comparison capabilities into open VLMs. The authors systematically evaluate zero-shot (off-the-shelf) and SFT scenarios across six medical datasets spanning radiology, ophthalmology, and dermatology, and show robust gains across reference-selection strategies (random, demographic, embedding-based, cross-center) and reference scales. They provide mechanistic and interpretability evidence, including attribution analyses, showing that comparison sharpens pathology-focused representations and improves sample efficiency. Overall, SiP offers a practical, scalable path to clinically aligned, reference-guided medical VLMs with demonstrated improvements in accuracy and robustness across diverse tasks.
Abstract
Medical image diagnosis is challenging because many diseases resemble normal anatomy and exhibit substantial interpatient variability. Clinicians routinely rely on comparative diagnosis, such as referencing cross-patient healthy control images to identify subtle but clinically meaningful abnormalities. Although healthy reference images are abundant in practice, existing medical vision-language models (VLMs) primarily operate in a single-image or single-series setting and lack explicit mechanisms for comparative diagnosis. This work investigates whether incorporating clinically motivated comparison can enhance VLM performance. We show that providing VLMs with both a query image and a matched healthy reference image, accompanied by cross-patient comparative prompts, significantly improves diagnostic performance. This performance can be further augmented by lightweight supervised fine-tuning (SFT) on a small amount of data. At the same time, we evaluate multiple strategies for selecting reference images, including random sampling, demographic attribute matching, embedding-based retrieval, and cross-center selection, and find consistently strong performance across all settings. Finally, we investigate why comparative diagnosis is effective theoretically, and observe improved sample efficiency and tighter alignment between visual and textual representations. Our findings highlight the clinical relevance of comparison-based diagnosis, provide practical strategies for incorporating reference images into VLMs, and demonstrate improved performance across diverse medical imaging tasks.
