Table of Contents
Fetching ...

PathVG: A New Benchmark and Dataset for Pathology Visual Grounding

Chunlin Zhong, Shuang Hao, Junhua Wu, Xiaona Chang, Jiwei Jiang, Xiu Nie, He Tang, Xiang Bai

TL;DR

This work introduces PathVG, a pathology-focused visual grounding benchmark for region-level localization conditioned on expressive language, addressing limitations of existing nuclei segmentation and visual QA tasks. It provides RefPath, a large-scale dataset with multi-scale pathology images and 33,500 language-grounded boxes, generated through expert annotation and LLM-assisted expressions. The proposed Pathology Knowledge-enhanced Network (PKNet) fuses visual, expression, and knowledge features via a Knowledge Fusion Module and a Vision–Language fusion head to map implicit pathology terms to explicit visual cues and predict $(x,y,w,h)$ boxes. On RefPath, PKNet achieves state-of-the-art results, validating the approach and highlighting the value of integrating domain knowledge for pathology grounding, while acknowledging the need for semi-/unsupervised strategies due to annotation costs.

Abstract

With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this, we propose a new benchmark called Pathology Visual Grounding (PathVG), which aims to detect regions based on expressions with different attributes. To evaluate PathVG, we create a new dataset named RefPath which contains 27,610 images with 33,500 language-grounded boxes. Compared to visual grounding in other domains, PathVG presents pathological images at multi-scale and contains expressions with pathological knowledge. In the experimental study, we found that the biggest challenge was the implicit information underlying the pathological expressions. Based on this, we proposed Pathology Knowledge-enhanced Network (PKNet) as the baseline model for PathVG. PKNet leverages the knowledge-enhancement capabilities of Large Language Models (LLMs) to convert pathological terms with implicit information into explicit visual features, and fuses knowledge features with expression features through the designed Knowledge Fusion Module (KFM). The proposed method achieves state-of-the-art performance on the PathVG benchmark.

PathVG: A New Benchmark and Dataset for Pathology Visual Grounding

TL;DR

This work introduces PathVG, a pathology-focused visual grounding benchmark for region-level localization conditioned on expressive language, addressing limitations of existing nuclei segmentation and visual QA tasks. It provides RefPath, a large-scale dataset with multi-scale pathology images and 33,500 language-grounded boxes, generated through expert annotation and LLM-assisted expressions. The proposed Pathology Knowledge-enhanced Network (PKNet) fuses visual, expression, and knowledge features via a Knowledge Fusion Module and a Vision–Language fusion head to map implicit pathology terms to explicit visual cues and predict boxes. On RefPath, PKNet achieves state-of-the-art results, validating the approach and highlighting the value of integrating domain knowledge for pathology grounding, while acknowledging the need for semi-/unsupervised strategies due to annotation costs.

Abstract

With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this, we propose a new benchmark called Pathology Visual Grounding (PathVG), which aims to detect regions based on expressions with different attributes. To evaluate PathVG, we create a new dataset named RefPath which contains 27,610 images with 33,500 language-grounded boxes. Compared to visual grounding in other domains, PathVG presents pathological images at multi-scale and contains expressions with pathological knowledge. In the experimental study, we found that the biggest challenge was the implicit information underlying the pathological expressions. Based on this, we proposed Pathology Knowledge-enhanced Network (PKNet) as the baseline model for PathVG. PKNet leverages the knowledge-enhancement capabilities of Large Language Models (LLMs) to convert pathological terms with implicit information into explicit visual features, and fuses knowledge features with expression features through the designed Knowledge Fusion Module (KFM). The proposed method achieves state-of-the-art performance on the PathVG benchmark.

Paper Structure

This paper contains 9 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A comparison of (a) Cellular nuclei segmentation, (b) Pathology Visual Question Answer and (c) our proposed PathVG benchmark.
  • Figure 2: (a) Previous Medical Visual Grounding. (b) Pathology Visual Grounding: Identical region at Lower (Left) and Higher (Right) Magnification, with expressions for cell arrangement and interactions with neighboring cells (Red), as well as cell structure and growth (Green).
  • Figure 3: (a): Example of RefPath: Displayed the differences in images and at different magnification levels. (b): Word cloud of the top 100 words in the RefPath dataset. Displayed the pathological terms in the expression.
  • Figure 4: Overview of the proposed method. The model uses the knowledge enhancement ability of LLMs to connect pathological expressions with visual features, integrates Expression and Knowledge features through KFM, and outputs the final language-grounded boxes by combining visual features with CFM.