Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Mehmet Saygin Seyfioglu; Wisdom O. Ikezogwo; Fatemeh Ghezloo; Ranjay Krishna; Linda Shapiro

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Mehmet Saygin Seyfioglu, Wisdom O. Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, Linda Shapiro

TL;DR

Quilt-LLaVA addresses the lack of spatial grounding in histopathology multimodal models by extracting localized narratives from open-source educational videos and grounding them to WSI patches. The authors introduce Quilt-Instruct, a large grounding QA dataset, and train Quilt-LLaVA in a two-stage process, enabling reasoning across patches and beyond single-image views. They also propose Quilt-VQA, a video-derived evaluation suite, to test diagnosis-oriented and instruction-following capabilities, achieving state-of-the-art results on open and closed histopathology VQA benchmarks. While promising, the work acknowledges limitations from data noise, potential LLM hallucinations, and language restrictions, outlining avenues for broader medical applications and more robust grounding in future work.

Abstract

Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

TL;DR

Abstract

Paper Structure (28 sections, 21 figures, 13 tables, 1 algorithm)

This paper contains 28 sections, 21 figures, 13 tables, 1 algorithm.

Introduction
Related work
Curating Quilt-Instruct
Data preparation
Generating Quilt-Instruct
Independent Prompts
Reasoning-based Prompts
Training Quilt-LLaVA & evaluating with Quilt-VQA
Training Quilt-LLaVA
Evaluation Data Generation: Quilt-VQA
Evaluation data generation: Instruction Following Test Set
Experiments
Conclusion and Limitations
Quilt-Instruct and Quilt-VQA
Total Cost
...and 13 more sections

Figures (21)

Figure 1: Quilt-LLaVA is capable of describing the prominent medical regions within a histopathology patch. Additionally, it can be utilized to reason towards a diagnosis based on the current observations. Note: The image includes eosinophils and lymphocytes, and is sampled from a WSI showing rare benign dermatitis.
Figure 2: To create Quilt-Instruct, we first identify stable chunks within the video. For each chunk, we compute a median frame in the pixel domain and subtract it from every frame within the chunk. We then apply a threshold to reduce noise and take the maximum value to capture the mouse cursor points. These cursor points are then clustered to localize medical content in image captions. Please note that color encodes time in the "Trace Clustering and Mapping" part of the figure.
Figure 3: Quilt-LLaVA was initialized with the general-domain LLaVA and trained for two stages: Histopathology Domain Alignment on Quilt and instruction-tuning on Quilt-Instruct. We evaluated Quilt-LLaVA on visual conversation and question answering tasks.
Figure 5: The GPT-4 prompt used to generate conversational instruction-following data.
Figure 6: The GPT-4 prompt used to generate detailed description instruction-following data.
...and 16 more figures

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

TL;DR

Abstract

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (21)