mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

Quan Liu; Ruining Deng; Can Cui; Tianyuan Yao; Vishwesh Nath; Yucheng Tang; Yuankai Huo

mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

Quan Liu, Ruining Deng, Can Cui, Tianyuan Yao, Vishwesh Nath, Yucheng Tang, Yuankai Huo

TL;DR

mTREE tackles the challenge of integrating textual pathology information with gigapixel WSIs in an end-to-end framework. It introduces a dual-level, text-guided representation learning approach that performs global-to-local localization and local-to-global aggregation, guided by text features via cosine similarity and a learned attention mechanism. The method demonstrates improvements in cancer grade classification and survival prediction on TCGA-KIRC and TCGA-GBMLGG, while providing attention-based visualizations for interpretability. This work enables efficient, weakly supervised WSI analysis that leverages clinical text to reduce patch enumeration and enhance explainability in pathology AI.

Abstract

Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g., slide-level). However, there is no effective way to integrate multi-scale image representations with text data in a seamless end-to-end process. In this study, we introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE). This novel text-guided approach effectively captures multi-scale WSI representations by utilizing information from accompanying textual pathology information. mTREE innovatively combines - the localization of key areas (global-to-local) and the development of a WSI-level image-text representation (local-to-global) - into a unified, end-to-end learning framework. In this model, textual information serves a dual purpose: firstly, functioning as an attention map to accurately identify key areas, and secondly, acting as a conduit for integrating textual features into the comprehensive representation of the image. Our study demonstrates the effectiveness of mTREE through quantitative analyses in two image-related tasks: classification and survival prediction, showcasing its remarkable superiority over baselines.

mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Multi-instance learning
Attention sampling
Visual language model in WSI analysis
Methods
Global alignment
Local alignment
End-to-end training with image sampler
Experiments
Data description
Data preprocessing
Network architectures
Training details
Baseline experiments for comparison
...and 6 more sections

Figures (4)

Figure 1: Comparison between multi-instance learning, pathologist diagnosis, and our proposed mTREE. (a) Traditional multi-instance learning needs to process all patches without patch selection. (b) Pathologists in the diagnosis process focus on the most essential patches selected by manual efforts. (c) Our proposed mTREE generates text-guided attention to sample efficiently without manual annotation.
Figure 2: This figure demonstrates the proposed mTREE pipeline. The upper panel shows the text process flow. The text encoder is frozen with pre-trained weights. Text feature $T_{0}$ is used for global alignment and alignment of image patch features. The lower panel shows the WSI analytic flow. The attention model learns an attention map from the WSI in low resolution. The attention map aligns with the text feature $T_0$. The image patches tiled up from high-resolution WSI are ranked by attention score. The image features $I_0, I_1... I_{k}$ abstracted from image patches with higher attention scores are aggregated with text feature $T_0$.
Figure 3: This figure presents the principle of multi-level text guidance. Global-level text guidance (upper panel) aligns the attention map from images and text. Image attention map is learned from low-resolution WSI, while text attention is projected from the text feature $T_0$. Local-level text guidance (lower panel) performs patch selection by computing the cosine similarity distance to the text feature $T_0$ and aggregates features from both image and text.
Figure 4: This figure presents the visualization of WSI-level attention and the automatically derived diagnosis patches. For WSIs in the TCGA-KIRC dataset and TCGA-GBMLGG dataset, the attention map (middle panels) is learned from WSI (left panels), highlighting essential tissue regions. Essential image patches (right panels) are selected according to the attention score. The image boundary color indicates the according attention score.

mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

TL;DR

Abstract

mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)