Table of Contents
Fetching ...

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

Xiao Zhou, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, Yanfeng Wang

TL;DR

This paper curates a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues, and develops a knowledge-enhanced visual-language pretraining approach, which is the first comprehensive structured pathology knowledge base.

Abstract

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain-specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via a language model, and use it to guide the visual representation learning; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs).

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

TL;DR

This paper curates a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues, and develops a knowledge-enhanced visual-language pretraining approach, which is the first comprehensive structured pathology knowledge base.

Abstract

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain-specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via a language model, and use it to guide the visual representation learning; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs).
Paper Structure (22 sections, 11 equations, 10 figures, 18 tables)

This paper contains 22 sections, 11 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Knowledge-enhanced pathology image-text alignment. The short caption of a pathology image crawled from public websites is typically unstructured and with varying granularities, which introduces noticeable ambiguities for image-text alignment. While the implicit structures and correlations between different image-caption pairs could be constructed by explicit disease attributes (dashed boxes), which can be well-aligned by a pathology knowledge tree. LUSC and LUAD suggest lung squamous cell carcinoma, and lung adenocarcinoma, respectively.
  • Figure 2: The construction of pathology knowledge tree. OncoTree is adopted as the base architecture to construct the PathKT. The tissue types, disease entities, and attributes are first extracted from web-crawled pathological descriptions, where cancers are then matched to OncoTree based on their tissue types and tumor types/subtypes using UMLS CUIs. Moreover, non-tumor diseases are added to the knowledge tree according to their tissue types. Finally, the pathology knowledge tree integrates 4718 diseases from 32 tissues, with each disease containing various synonyms, definitions, and histological and cytological features. CSCLC in this figure suggests combined small cell lung cancer.
  • Figure 3: Knowledge encoder pretraining based on metric learning. $n$ disease entities and each with $k$ attributes, including disease synonyms, definitions, cytology and pathology features, construct a mini-batch (left part of the figure), which are fed to a knowledge encoder for pretraining. In the embedding space (right part of the figure), the markers in different shapes represent the embeddings of attributes of different diseases. $S^+_i$ suggests the max-min positive attribute similarity within the $i$-th disease, while $S^-_i$ denotes the maximal attribute similarity between the $i$-th disease and other diseases. The goal of metric learning is to increase $S^+_i$ and meanwhile decreasing $S^-_i$. The purple-dashed arrow and circle denote the minimal positive attribute similarity in the second class and the hypersphere it spans.
  • Figure 4: Model architecture (left graph). A projection head is added on the top of the visual encoder to bridge the gap between the image and the text encoder. The knowledge encoder is frozen across the whole training stage to distill pathology knowledge to the learnable text encoder. As a result, the pathology images can be aligned with their implicit disease labels ( marked by dashed boxes in the right graph) during visual-language pretraining, since the captions contain disease attributes that have been already aligned with disease names/synonyms in the knowledge embedding space.
  • Figure 5: The comparison of zero-shot patch classification between different models. The left and the right subfigures suggest pretraining on OpenPath and Quilt1M, respectively. The visual encoders of KEP-32 and KEP-16 are initialized by CLIP (ViT-B-32) and BiomedCLIP (Vit-B-16), respectively. The number of points for every box is 100, with each representing the performance of one text prompt. The upper, center, and lower line of each box denote the first, median, and third quartile of the distribution.
  • ...and 5 more figures