HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction

Zhengrui Guo; Jiabo Ma; Yingxue Xu; Yihui Wang; Liansheng Wang; Hao Chen

HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction

Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, Hao Chen

TL;DR

The paper addresses automated histopathology report generation for gigapixel WSIs, aiming to relieve pathologists' workload. HistGen introduces a local-global hierarchical encoder and a cross-modal context module within a MIL-based framework, leveraging a pre-trained DINOv2 ViT-L feature extractor to handle patch sequences. They curate a WSI–report dataset of 7,753 TCGA pairs and demonstrate state-of-the-art performance in WSI report generation, cancer subtyping, and survival analysis, with strong transfer learning capabilities. The work provides a public benchmark, dataset, and code, highlighting practical impact for AI-assisted pathology and potential extension to other medical imaging domains.

Abstract

Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in https://github.com/dddavid4real/HistGen.

HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction

TL;DR

Abstract

Paper Structure (9 sections, 2 equations, 4 figures, 6 tables)

This paper contains 9 sections, 2 equations, 4 figures, 6 tables.

Introduction
Method
WSI-Report Dataset Curation
HistGen for Automated WSI Report Generation
Experiments
Implementation Details
WSI Report Generation Results
Transfer Learning for Cancer Diagnosis and Prognosis
Conclusion

Figures (4)

Figure 1: Overview of the proposed HistGen framework: (a) local-global hierarchical encoder module, (b) cross-modal context module, (c) decoder module, (d) transfer learning strategy for cancer diagnosis and prognosis.
Figure 2: Qualitative analysis of the proposed HistGen model. Words highlighted in bold green indicate alignment between our model’s generated results and the ground truth. Conversely, words underlined in orange represent diagnostic details that our model fails to capture. The first two examples highlight the superior captioning capability of our model, with it accurately diagnosing provided WSIs. The diagnoses closely align with the ground truths, differing only in minor, non-critical aspects. In the third example, our model successfully makes the correct prediction, despite the absence of the detailed context present in the ground truth.
Figure 3: WSI distribution for DINOv2 ViT-L feature extractor pre-training. We have collected over 30 different pathology datasets containing over 60 primary sites. Patches are extracted from whole slide images at level $0$, with dimensions of $512\times 512$. These patches are subsequently resized to $224\times 224$ for pre-training the feature extractor. This figure shows the details of our collected WSIs.
Figure 4: Patch distribution for DINOv2 ViT-L feature extractor pre-training.

HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction

TL;DR

Abstract

HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)