Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang; Yuji Lin; Dongrun Zhu; Jiayue Cai; Sunil Kalia; Harvey Lui; Chunqi Chang; Z. Jane Wang; Tim K. Lee

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee

TL;DR

A transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images is proposed that enables efficient access to relevant medical records and supports practical clinical deployment.

Abstract

Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

TL;DR

Abstract

Paper Structure (14 sections, 14 equations, 4 figures, 2 tables)

This paper contains 14 sections, 14 equations, 4 figures, 2 tables.

Introduction
Method
Problem formulation
Overview
Hierarchical visual encoding
Text encoding and cross-modal composition
Joint alignment of multi-level representations
Experiments
Benchmark dataset
Evaluation metrics
Results and Discussion
Quantitative results
Qualitative analysis
Conclusion

Figures (4)

Figure 1: Types of clinical skin cancer case search and query. Conventional retrieval uses image-only queries or text-only clinical descriptors separately, whereas our composed retrieval pairs a reference lesion image with its associated text to form a vision-language query. All settings aim to retrieve visually similar, biopsy-confirmed cases from an image-only database to support clinical decision making.
Figure 2: Overall workflow of composed vision-language retrieval for skin cancer. The query image and text are fused via cross-modal Transformers on top of a hierarchical vision backbone to form multi-level composed query representations (a), while each database image is encoded by the same backbone (b). Multi-level global and local alignment jointly compute query-target similarity for retrieval ranking (c).
Figure 3: Meta report template used to construct textual queries.
Figure 4: Qualitative composed retrieval examples on Derm7pt. Each row shows one query (left: query image; middle: associated clinical text) and the top-5 retrieved images (right, R@1--R@5). Rows 1--3 correspond to mel (red borders), bkl (yellow borders), and nevus (blue borders), respectively.

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

TL;DR

Abstract

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (4)