Table of Contents
Fetching ...

PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

Yating Huang, Ziyan Huang, Lintao Xiang, Qijun Yang, Hujun Yin

TL;DR

PathoHR addresses the gap in pathology VL models by focusing on hierarchical structure and compositional reasoning in diagnostic narratives. It introduces PathoHR-Bench to systematically probe text perturbations and semantic roles, revealing limitations of existing VL models in pathology-specific reasoning. To overcome these gaps, the authors propose a pathology-driven training scheme with four branches—textual perturbation, hierarchical reasoning-based text expansion, dual-constraint negative mining, and wavelet-morphology refinement—linked by a CLIP-like contrastive loss with additional negative and positive terms: $L = L_{con} + \alpha \cdot L_{neg} + \beta \cdot L_{pos}$. Empirical results show state-of-the-art performance on PathoHR-Bench and six pathology datasets, demonstrating improved fine-grained pathology representation and potential clinical impact, while also outlining limitations and directions for future work in integrating external medical knowledge and scaling perturbation strategies.

Abstract

Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models' abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

TL;DR

PathoHR addresses the gap in pathology VL models by focusing on hierarchical structure and compositional reasoning in diagnostic narratives. It introduces PathoHR-Bench to systematically probe text perturbations and semantic roles, revealing limitations of existing VL models in pathology-specific reasoning. To overcome these gaps, the authors propose a pathology-driven training scheme with four branches—textual perturbation, hierarchical reasoning-based text expansion, dual-constraint negative mining, and wavelet-morphology refinement—linked by a CLIP-like contrastive loss with additional negative and positive terms: . Empirical results show state-of-the-art performance on PathoHR-Bench and six pathology datasets, demonstrating improved fine-grained pathology representation and potential clinical impact, while also outlining limitations and directions for future work in integrating external medical knowledge and scaling perturbation strategies.

Abstract

Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models' abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

Paper Structure

This paper contains 24 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Radar charts for compared models on PathoHR-Bench across multiple compositional reasoning aspects. The axes correspond to three types of perturbations: I (Information Loss), S (Semantic Drift), O (Order Variation), evaluated under three semantic roles: Entities, Descriptors and Connections. Higher values indicate stronger robustness.
  • Figure 2: Cross-dimensional taxonomy in PathoHR-Bench: Text perturbation levels and semantic role levels.
  • Figure 3: Overview of proposed PathoHR-Bench, comprising three sensitivity tests (top row) with performances of existing VL models (bottom left). Bottom right shows further semantic perturbation levels.
  • Figure 4: Structured textual and visual data manipulation for pathological VL training.
  • Figure A1: Cases study of semantic drift perturbations at the connections level, information loss perturbations at the descriptors level, and order variation perturbations at the entities level within the PathoHR-Bench.
  • ...and 1 more figures