Table of Contents
Fetching ...

Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

TL;DR

This work tackles the challenge of interpreting gigapixel pathology images with general-purpose large multimodal models. It introduces GIANT, an agent-based framework that iteratively pans, zooms, and reasons across whole-slide images, and pairs it with MultiPathQA, a benchmark comprising 934 WSI-level questions across five clinical tasks plus a pathologist-authored ExpertVQA subset. Empirical results show that GPT-5 with GIANT substantially outperforms thumbnail and patch baselines and can rival specialist pathology systems on several tasks, with performance improving as the agent makes more iterations. A pathologist evaluation highlights both strengths—selecting diagnostically relevant regions—and limitations—finding it difficult to synthesize visually grounded findings into final diagnoses. The work also provides ablations, comparing deliberate versus random region selection, and explores integrating pathology priors and thumbnail resolutions, underscoring GIANT’s potential to broaden the capabilities of foundation models in expert reasoning for pathology and multi-scale image understanding.

Abstract

Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

Navigating Gigapixel Pathology Images with Large Multimodal Models

TL;DR

This work tackles the challenge of interpreting gigapixel pathology images with general-purpose large multimodal models. It introduces GIANT, an agent-based framework that iteratively pans, zooms, and reasons across whole-slide images, and pairs it with MultiPathQA, a benchmark comprising 934 WSI-level questions across five clinical tasks plus a pathologist-authored ExpertVQA subset. Empirical results show that GPT-5 with GIANT substantially outperforms thumbnail and patch baselines and can rival specialist pathology systems on several tasks, with performance improving as the agent makes more iterations. A pathologist evaluation highlights both strengths—selecting diagnostically relevant regions—and limitations—finding it difficult to synthesize visually grounded findings into final diagnoses. The work also provides ablations, comparing deliberate versus random region selection, and explores integrating pathology priors and thumbnail resolutions, underscoring GIANT’s potential to broaden the capabilities of foundation models in expert reasoning for pathology and multi-scale image understanding.

Abstract

Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

Paper Structure

This paper contains 27 sections, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), a framework that enables general-purpose LMMs to iteratively pan, zoom, and reason across a gigapixel pathology image. We compare this to prior approaches, which either provide a low-resolution thumbnail to the model or provide multiple patches of the image to models. We show that our approach dramatically outperforms prior simple baselines using GPT-5, and surpasses specialized pathology chat models trained on thousands of whole-slide images.
  • Figure 2: Barplot comparison of GPT-5 with the GIANT framework versus leading foundation models trained on whole-slide pathology images. We compare against top-performing models for zero-shot classification (TITAN) and slide-level question answering (SlideChat), as well as prior baseline methods chen2025slidechat. In the thumbnail setting, the model receives a 1024×1024 px image; in the patch setting, it receives 30 segmented 224×224 px patches (via the CLAM pipeline clam) aggregated by majority vote. GIANT ×5 denotes five independent runs with majority voting.
  • Figure 3: A. We introduce MultiPathQA, a benchmark suite for evaluating LMMs on whole-slide question answering, assessing both diagnostic performance and clinically relevant question-answering capabilities. It uses publicly available WSIs from TCGA, GTEX, and the PANDA Challenge. In our study, two pathologists (one resident, one attending) generated 128 questions after manual review of 96 TCGA slides requiring. These questions are designed to require direct image interpretation at multiple scales. B. Examples of the five tasks in MultiPathQA. C. To evaluate GIANT's reasoning, a board-certified pathologist reviewed 28 model reasoning traces where GPT-5 diagnosed benign or cancer cases.
  • Figure 4: Examples of GIANT reasoning traces for cancer diagnosis and multiple-choice question-answering. The model is instructed to produce its final response after 20 iterations, which can be modified using a system prompt. See Supplementary Material for prompts.
  • Figure 5: Performance scaling of GIANT (GPT-5) with the maximum number of allowed iterations. We use a system prompt to enforce that the model provide its final response after a specific number of iterations, marking a trial incorrect if the model exceeds this limit after 3 retries.
  • ...and 2 more figures