Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

Shengxuming Zhang; Weihan Li; Tianhong Gao; Jiacong Hu; Haoming Luo; Xiuming Zhang; Jing Zhang; Mingli Song; Zunlei Feng

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

Shengxuming Zhang, Weihan Li, Tianhong Gao, Jiacong Hu, Haoming Luo, Xiuming Zhang, Jing Zhang, Mingli Song, Zunlei Feng

TL;DR

OmniPath tackles the resolution and context gaps in large vision-language models for pathology by introducing two complementary strategies: mixed task-guided feature enhancement (MGFE) and prompt-guided detail feature completion (PGFC). The approach enriches multi-scale feature extraction and enables selective high-resolution detail fusion, using an enhanced architecture (including an extra pathology-focused ViT UNI and a mask decoder) and a large, multi-institution dataset of ~490K samples across 21 organs. Empirical results show OmniPath achieving state-of-the-art performance across patch-level and slide-level diagnostic tasks, zero-shot classification, and detection/segmentation, while maintaining inference efficiency. These findings suggest a clinically viable, interactive framework for comprehensive pathology analysis and auxiliary diagnosis, with future work planned around retrieval augmentation, specialized multi-agent reasoning, and reinforcement-learning–driven improvements to autonomous diagnostic capabilities.

Abstract

Pathological diagnosis is vital for determining disease characteristics, guiding treatment, and assessing prognosis, relying heavily on detailed, multi-scale analysis of high-resolution whole slide images (WSI). However, existing large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy in pathology image analysis. To overcome these issues, we propose two innovative strategies: the mixed task-guided feature enhancement, which directs feature extraction toward lesion-related details across scales, and the prompt-guided detail feature completion, which integrates coarse- and fine-grained features from WSI based on specific prompts without compromising inference speed. Leveraging a comprehensive dataset of 490K samples from diverse pathology tasks, we trained the pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that this model significantly outperforms existing methods in diagnostic accuracy and efficiency, providing an interactive, clinically aligned approach for auxiliary diagnosis in a wide range of pathology applications.

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

TL;DR

Abstract

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)