Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions
Ruben T. Lucassen, Sander P. J. Moonemans, Tijn van de Luijtgaarden, Gerben E. Breimer, Willeke A. M. Blokx, Mitko Veta
TL;DR
We address the workload burden of pathology report generation for melanocytic skin lesions by building a domain-specific vision-language pipeline that aligns WSI tiles with rich pathology narratives. The framework follows the Contrastive Captioner paradigm within a PRISM/CoCa-inspired architecture, using a frozen image encoder (UNI) and a BiogPT-based language model with cross-attention, trained with a combined $\mathcal{L}_{\text{Con}}$ and $\mathcal{L}_{\text{Cap}}$ objective. On a large, curated dataset of 42,512 WSIs and 19,645 reports, the system achieves report quality comparable to pathologists for common nevi but trails for rarer subtypes, while cross-modal retrieval shows meaningful signal, particularly for non-common lesions. The work demonstrates practical potential to reduce routine reporting time in digital pathology and highlights directions to improve generalization and retrieval via metadata integration and expanded datasets.
Abstract
Millions of melanocytic skin lesions are examined by pathologists each year, the majority of which concern common nevi (i.e., ordinary moles). While most of these lesions can be diagnosed in seconds, writing the corresponding pathology report is much more time-consuming. Automating part of the report writing could, therefore, alleviate the increasing workload of pathologists. In this work, we develop a vision-language model specifically for the pathology domain of cutaneous melanocytic lesions. The model follows the Contrastive Captioner framework and was trained and evaluated using a melanocytic lesion dataset of 42,512 H&E-stained whole slide images and 19,645 corresponding pathology reports. Our results show that the quality scores of model-generated reports were on par with pathologist-written reports for common nevi, assessed by an expert pathologist in a reader study. While report generation revealed to be more difficult for rare melanocytic lesion subtypes, the cross-modal retrieval performance for these cases was considerably better.
