Table of Contents
Fetching ...

Quilt-1M: One Million Image-Text Pairs for Histopathology

Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, Linda Shapiro

TL;DR

This paper addresses the lack of large-scale, in-domain vision-language data for histopathology by constructing Quilt from $1{,}087$ hours of educational YouTube videos, yielding $437{,}878$ images and $802{,}144$ text pairs, and then expanding to Quilt-1M by integrating PubMed Open Access, LAION-5B, and Twitter/OpenPath data to reach about $1.02$M image-text pairs. The authors train and evaluate QuiltNet, a CLIP-based model finetuned on Quilt-1M, using multiple image and text encoders, and demonstrate strong zero-shot, linear probing, and cross-modal retrieval performance across 13 external histopathology datasets and 8 sub-pathologies, outperforming baselines like CLIP and BiomedCLIP. The work presents a scalable, privacy-conscious data-curation pipeline that leverages ASR, LLMs, and domain ontologies (UMLS) to produce densely annotated image-text pairs with rich medical content. By combining diverse sources and providing a public, large-scale histopathology vision-language resource, Quilt-1M advances representation learning for histopathology and enables more robust cross-modal understanding in this domain.

Abstract

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering $1,087$ hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate QUILT: a large-scale vision-language dataset consisting of $802, 144$ image and text pairs. QUILT was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around $200$K samples. We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across $13$ diverse patch-level datasets of $8$ different sub-pathologies and cross-modal retrieval tasks.

Quilt-1M: One Million Image-Text Pairs for Histopathology

TL;DR

This paper addresses the lack of large-scale, in-domain vision-language data for histopathology by constructing Quilt from hours of educational YouTube videos, yielding images and text pairs, and then expanding to Quilt-1M by integrating PubMed Open Access, LAION-5B, and Twitter/OpenPath data to reach about M image-text pairs. The authors train and evaluate QuiltNet, a CLIP-based model finetuned on Quilt-1M, using multiple image and text encoders, and demonstrate strong zero-shot, linear probing, and cross-modal retrieval performance across 13 external histopathology datasets and 8 sub-pathologies, outperforming baselines like CLIP and BiomedCLIP. The work presents a scalable, privacy-conscious data-curation pipeline that leverages ASR, LLMs, and domain ontologies (UMLS) to produce densely annotated image-text pairs with rich medical content. By combining diverse sources and providing a public, large-scale histopathology vision-language resource, Quilt-1M advances representation learning for histopathology and enables more robust cross-modal understanding in this domain.

Abstract

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate QUILT: a large-scale vision-language dataset consisting of image and text pairs. QUILT was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around K samples. We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: QUILT-1M, with M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across diverse patch-level datasets of different sub-pathologies and cross-modal retrieval tasks.
Paper Structure (24 sections, 1 equation, 14 figures, 11 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of Quilt curation pipeline. We identify relevant histopathology YouTube videos in Search. For Image extraction, we find and de-noise histopathology frames using trained models. In Text section, we rely on a conventional Automatic Speech Recognition (ASR) model and leverage Unified Medical Language System (UMLS) and large language models (LLMs) for post-processing and ASR error correction. Relevant sub-pathology, medical and region-of-interest (ROI) text are extracted using an LLM. Finally, domain-specific algorithms are used to Pair images and text, eliminating duplicates to yield Quilt, a richly annotated image-text dataset for histopathology.
  • Figure 2: Quilt examples. Input is the corrected ASR caption for the representative image. Output are the medical and ROI extracted text(s) paired with the image (see Section \ref{['sec:quilt_main']}). In histopathology, understanding tissue characteristics often involves views from varying magnification levels. Thus, in Quilt we estimate an image’s magnification (indicated by the relative size of the microscope icon).
  • Figure 3: Prompting the LLM by providing few-shot examples to perform the following tasks: (Left) correcting noisy ASR text within its context. We highlight the probable list of misspelled keywords in yellow and their corrections by the LLM in gray. Additional missed errors/misspelled entries identified by the LLM are highlighted in blue. (Right) extracting medical (MED) and ROI text from a given text. We highlight the definition of medical and ROI text in blue and gray respectively.
  • Figure 4: (a) Distribution of all videos over sub-pathology types. (b) Distribution of our entities across top 20 UMLS semantic types. (c) Number of image-text pairs within each sub-pathology type. (d) word cloud of all the text in Quilt.
  • Figure 5: QuiltNet, outperforms out-of-domain CLIP baseline and state-of-the-art histopathology models across 12 zero-shot tasks, covering 8 different sub-pathologies (accuracy percentage provided).
  • ...and 9 more figures