Table of Contents
Fetching ...

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao, Guowei Zhao, Kongming Liang, Wenbin Li, Zhanyu Ma

TL;DR

Diagnosing hepatocellular carcinoma from gigapixel WSIs is hindered by heterogeneity and the trade-off between patch detail and global context. The authors propose Sparse Topo-Pack Attention to model 2D tissue topology, introduce the HepatoPathoVQA multi-scale dataset, and develop Hepato-LLaVA via a three-stage training pipeline with a Q-Former connector, achieving state-of-the-art results (Avg 0.83) across morphology and diagnosis tasks with robust multi-scale consistency. This topology-aware, multi-scale approach reduces redundancy while preserving critical diagnostic cues, enabling precise, clinically relevant VQA and captioning for HCC pathology. Collectively, the work advances efficient, fine-grained WSI analysis and has potential to improve real-world pathology workflows.

Abstract

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

TL;DR

Diagnosing hepatocellular carcinoma from gigapixel WSIs is hindered by heterogeneity and the trade-off between patch detail and global context. The authors propose Sparse Topo-Pack Attention to model 2D tissue topology, introduce the HepatoPathoVQA multi-scale dataset, and develop Hepato-LLaVA via a three-stage training pipeline with a Q-Former connector, achieving state-of-the-art results (Avg 0.83) across morphology and diagnosis tasks with robust multi-scale consistency. This topology-aware, multi-scale approach reduces redundancy while preserving critical diagnostic cues, enabling precise, clinically relevant VQA and captioning for HCC pathology. Collectively, the work advances efficient, fine-grained WSI analysis and has potential to improve real-world pathology workflows.

Abstract

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.
Paper Structure (11 sections, 4 equations, 3 figures, 3 tables)

This paper contains 11 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the HepatoPathoVQA construction pipeline: (1) Extracts ROIs and Patches from WSIs using MST-based clustering and triangular seed-point selection. (2) Employs Gemini-3-flash for hierarchical inference by integrating macroscopic descriptions as context for subsequent microscopic analysis. (3) Generates multi-scale QA pairs and captions for instruction tuning and alignment.
  • Figure 2: Overview of the Hepato-LLaVA framework: (Upper) Incorporates Sparse Topo-Pack Attention into the model architecture. (Lower) Implements a three-stage training pipeline: MAE pre-training, MoCo pre-training, and instruction tuning (via LoRA). The sparse attention mask defines three topological interactions: (1) Global Sink for macro-context broadcasting, (2) Intra-Pack for local dense interactions, and (3) Inter-Pack for summary-level connections across packs.
  • Figure 3: A representative diagnostic case from HepatoPathoVQA, illustrating Hepato-LLaVA’s capability to provide precise by interpreting critical morphological evidence.