Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley; Kian R. Weihrauch; Katherine Latham; Andrew Z. Zhou; Padmini A. Manrai; Arjun K. Manrai

Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

TL;DR

This work tackles the challenge of interpreting gigapixel pathology images with general-purpose large multimodal models. It introduces GIANT, an agent-based framework that iteratively pans, zooms, and reasons across whole-slide images, and pairs it with MultiPathQA, a benchmark comprising 934 WSI-level questions across five clinical tasks plus a pathologist-authored ExpertVQA subset. Empirical results show that GPT-5 with GIANT substantially outperforms thumbnail and patch baselines and can rival specialist pathology systems on several tasks, with performance improving as the agent makes more iterations. A pathologist evaluation highlights both strengths—selecting diagnostically relevant regions—and limitations—finding it difficult to synthesize visually grounded findings into final diagnoses. The work also provides ablations, comparing deliberate versus random region selection, and explores integrating pathology priors and thumbnail resolutions, underscoring GIANT’s potential to broaden the capabilities of foundation models in expert reasoning for pathology and multi-scale image understanding.

Abstract

Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

Navigating Gigapixel Pathology Images with Large Multimodal Models

TL;DR

Abstract

Navigating Gigapixel Pathology Images with Large Multimodal Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)