Table of Contents
Fetching ...

HOMIE: Histopathology Omni-modal Embedding for Pathology Composed Retrieval

Qifeng Zhou, Wenliang Zhong, Thao M. Dang, Hehuan Ma, Saiyang Na, Yuzhi Guo, Junzhou Huang

TL;DR

This work defines Pathology Composed Retrieval (PCR) as retrieving evidence from omni-modal clinical data using interleaved queries, addressing the limitations of dual-encoder, low-resolution pathology models and the absence of a suitable benchmark. It introduces HOMIE, a two-stage adaptation framework that first tailors a multimodal LLM for retrieval via text-only pre-training with LoRA, then performs pathology-specific tuning with native-resolution inputs, stain augmentation, and a progressive knowledge curriculum to bridge domain gaps, all trained on public data. A dedicated PCR Benchmark evaluates composed retrieval across multi-image, image-text, and video-text modalities, revealing that HOMIE achieves state-of-the-art performance on traditional retrieval tasks and significantly outperforms baselines on PCR tasks. The results demonstrate that a unified omni-modal embedding enables a transparent, evidence-grounded computational consult in pathology, with potential extensions to incorporate genomics and other omics data for more comprehensive clinical decision support.

Abstract

The integration of Artificial Intelligence (AI) into pathology faces a fundamental challenge: black-box predictive models lack transparency, while generative approaches risk clinical hallucination. A case-based retrieval paradigm offers a more interpretable alternative for clinical adoption. However, current SOTA models are constrained by dual-encoder architectures that cannot process the composed modality of real-world clinical queries. We formally define the task of Pathology Composed Retrieval (PCR). However, progress in this newly defined task is blocked by two critical challenges: (1) Multimodal Large Language Models (MLLMs) offer the necessary deep-fusion architecture but suffer from a critical Task Mismatch and Domain Mismatch. (2) No benchmark exists to evaluate such compositional queries. To solve these challenges, we propose HOMIE, a systematic framework that transforms a general MLLM into a specialized retrieval expert. HOMIE resolves the dual mismatch via a two-stage process: a retrieval-adaptation stage to solve the task mismatch, and a pathology-specific tuning stage, featuring a progressive knowledge curriculum, pathology specfic stain and native resolution processing, to solve the domain mismatch. We also introduce the PCR Benchmark, a benchmark designed to evaluate composed retrieval in pathology. Experiments show that HOMIE, trained only on public data, matches SOTA performance on traditional retrieval tasks and outperforms all baselines on the newly defined PCR task.

HOMIE: Histopathology Omni-modal Embedding for Pathology Composed Retrieval

TL;DR

This work defines Pathology Composed Retrieval (PCR) as retrieving evidence from omni-modal clinical data using interleaved queries, addressing the limitations of dual-encoder, low-resolution pathology models and the absence of a suitable benchmark. It introduces HOMIE, a two-stage adaptation framework that first tailors a multimodal LLM for retrieval via text-only pre-training with LoRA, then performs pathology-specific tuning with native-resolution inputs, stain augmentation, and a progressive knowledge curriculum to bridge domain gaps, all trained on public data. A dedicated PCR Benchmark evaluates composed retrieval across multi-image, image-text, and video-text modalities, revealing that HOMIE achieves state-of-the-art performance on traditional retrieval tasks and significantly outperforms baselines on PCR tasks. The results demonstrate that a unified omni-modal embedding enables a transparent, evidence-grounded computational consult in pathology, with potential extensions to incorporate genomics and other omics data for more comprehensive clinical decision support.

Abstract

The integration of Artificial Intelligence (AI) into pathology faces a fundamental challenge: black-box predictive models lack transparency, while generative approaches risk clinical hallucination. A case-based retrieval paradigm offers a more interpretable alternative for clinical adoption. However, current SOTA models are constrained by dual-encoder architectures that cannot process the composed modality of real-world clinical queries. We formally define the task of Pathology Composed Retrieval (PCR). However, progress in this newly defined task is blocked by two critical challenges: (1) Multimodal Large Language Models (MLLMs) offer the necessary deep-fusion architecture but suffer from a critical Task Mismatch and Domain Mismatch. (2) No benchmark exists to evaluate such compositional queries. To solve these challenges, we propose HOMIE, a systematic framework that transforms a general MLLM into a specialized retrieval expert. HOMIE resolves the dual mismatch via a two-stage process: a retrieval-adaptation stage to solve the task mismatch, and a pathology-specific tuning stage, featuring a progressive knowledge curriculum, pathology specfic stain and native resolution processing, to solve the domain mismatch. We also introduce the PCR Benchmark, a benchmark designed to evaluate composed retrieval in pathology. Experiments show that HOMIE, trained only on public data, matches SOTA performance on traditional retrieval tasks and outperforms all baselines on the newly defined PCR task.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Left: HOMIE is designed to function as a "computational consult". Given a composed query containing interleaved multi-modal data (e.g., text, multiple images, video) from a patient case. HOMIE can generates an omni-modal embedding to retrieve the most relevant evidence (e.g., similar historical cases) from a large database. This retrieved evidence is then presented to empower doctor and pathologists, enabling expert-guided clinical decision support. Right: Performance comparison shows HOMIE's superior performance across a range of tasks. For instance, $q^t \leftrightarrow c^i$ and Tile Cls. represent image-text retrieval and tile classification, respectively.
  • Figure 2: Overview of the HOMIE framework. The model ingests arbitrary modalities (e.g., image, text, video) and leverages a prompt-guided LLM to generate a unified omni-modal embedding. The vision pathway is optimized for pathology data, featuring pathology-specific stain normalization/augmentation and native resolution input. We employ a two-stage contrastive learning strategy to train the model, addressing task mismatch followed by domain mismatch.
  • Figure 3: Qualitative comparison on Composed Retrieval tasks. (Top) Image-Text to Image retrieval and (Bottom) Image-Text to Text retrieval. HOMIE is compared against the top-performing baselines from our benchmark (Conch, Musk, and Pathgen-CLIP).
  • Figure 4: UMAP visualization of the modality gap. We plot image (red) and text (blue) embeddings from the EduContent dataset for top baselines and our method. $||\Delta||_{gap}$ denotes modality gap metric liang2022mind.