Table of Contents
Fetching ...

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

Shengxuming Zhang, Weihan Li, Tianhong Gao, Jiacong Hu, Haoming Luo, Xiuming Zhang, Jing Zhang, Mingli Song, Zunlei Feng

TL;DR

OmniPath tackles the resolution and context gaps in large vision-language models for pathology by introducing two complementary strategies: mixed task-guided feature enhancement (MGFE) and prompt-guided detail feature completion (PGFC). The approach enriches multi-scale feature extraction and enables selective high-resolution detail fusion, using an enhanced architecture (including an extra pathology-focused ViT UNI and a mask decoder) and a large, multi-institution dataset of ~490K samples across 21 organs. Empirical results show OmniPath achieving state-of-the-art performance across patch-level and slide-level diagnostic tasks, zero-shot classification, and detection/segmentation, while maintaining inference efficiency. These findings suggest a clinically viable, interactive framework for comprehensive pathology analysis and auxiliary diagnosis, with future work planned around retrieval augmentation, specialized multi-agent reasoning, and reinforcement-learning–driven improvements to autonomous diagnostic capabilities.

Abstract

Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy in pathology image analysis. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490K samples from diverse pathology tasks, we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, providing an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

TL;DR

OmniPath tackles the resolution and context gaps in large vision-language models for pathology by introducing two complementary strategies: mixed task-guided feature enhancement (MGFE) and prompt-guided detail feature completion (PGFC). The approach enriches multi-scale feature extraction and enables selective high-resolution detail fusion, using an enhanced architecture (including an extra pathology-focused ViT UNI and a mask decoder) and a large, multi-institution dataset of ~490K samples across 21 organs. Empirical results show OmniPath achieving state-of-the-art performance across patch-level and slide-level diagnostic tasks, zero-shot classification, and detection/segmentation, while maintaining inference efficiency. These findings suggest a clinically viable, interactive framework for comprehensive pathology analysis and auxiliary diagnosis, with future work planned around retrieval augmentation, specialized multi-agent reasoning, and reinforcement-learning–driven improvements to autonomous diagnostic capabilities.

Abstract

Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy in pathology image analysis. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490K samples from diverse pathology tasks, we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, providing an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 32 tables.

Figures (6)

  • Figure 1: Dialogue examples of our OmniPath, a vision-language model optimized for pathology, applied to referring expression detection, segmentation, and visual question answering. Notably, in the first example, OmniPath is tasked with detecting cancer cell nuclei within blood vessels. Results show that OmniPath accurately identifies most nuclei within vessels without mistakenly detecting any outside, demonstrating its capability to understand pathological concepts and reason effectively.
  • Figure 2: The green contours on the pathology slides mark cancerous regions annotated by pathologists. The first column shows the attention distribution heatmap of the LLM's final input token over all image tokens, where the intensity of attention values is mapped from blue (low) to red (high). In each row showing different model results, a red box and a yellow box are used to select a key token (with relatively high attention) and an ordinary token (with relatively low attention) respectively. The attention distributions of the selected key token and ordinary token over other image tokens are then visualized in the second and third columns respectively. All experiments were conducted using identical prompts, with attention values extracted from the first layer of the LLM.
  • Figure 3: Overview of the proposed OmniPath. Left: the architecture of OmniPath with the MGFE and PGFC strategy. The MGFE model module improvements include a multi-scale feature fusion vision encoder and an additional mask decoder. The PGFC process, shown by the red dashed line, involves inputting a WSI thumbnail and its corresponding prompt into OmniPath. The top-$S$ patches with the highest attention values are selected, and their higher-resolution images are retrieved from the original WSI and added as supplementary input to OmniPath. Right: the detailed structure of the multi-scale feature fusion vision encoder.
  • Figure 4: More samples of the attention of upcoming token on image tokens (like the fist column in \ref{['fig:attn']} of the main paper). The intensity of attention values is mapped from blue (low) to red (high), and the green contours on the pathology slides mark cancerous regions annotated by pathologists. It can be observed that the image tokens focused on by OmniPath are generally concentrated within the cancerous regions.
  • Figure 5: t-SNE visualization results of slide-level image features extracted by vision encoders of Quilt-LLaVA (first row) and OmniPath (second row), respectively. Based on the cancer regions annotated by pathologists, we classify the image feature tokens into two categories: benign and cancer. It can be seen that the image features extracted by OmniPath demonstrate superior inter-class discriminability and intra-class diversity.
  • ...and 1 more figures