Table of Contents
Fetching ...

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Honglin Li, Yunlong Zhang, Pingyi Chen, Zhongyi Shui, Chenglu Zhu, Lin Yang

TL;DR

This paper analyzes how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling and proposes a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling.

Abstract

Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at github.com/invoker-LL/Long-MIL.

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

TL;DR

This paper analyzes how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling and proposes a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling.

Abstract

Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at github.com/invoker-LL/Long-MIL.

Paper Structure

This paper contains 30 sections, 14 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Handling an extremely long sequence with a magnification of $20\times$ (or quadrupling to $40\times$) poses a significant challenge. The computational complexity of transformers, denoted as $O(n^2)$, becomes prohibitive in such cases, leading to computational explosion.
  • Figure 2: Rank and sparsity of attention matrix in WSI analysis.
  • Figure 3: LongMIL framework for WSI local-global spatial contextual information interaction and fusion. 1) Preparing patch feature embedding and 2-d positions of WSIs. 2) Performing pairwise computations among all positions within a WSI by local masking as acceleration. 3) Overall local-global forward of the model, where position information need to be feed to both local (local masking) and global (positional embedding).
  • Figure 4: upper left: The WSI fore-ground shows irregularity (inner the green line). upper right and lower left: The 2-d position index of WSI foreground patches mainly scattered within index$<$100, thus area enclosed by the dashedline suffers under-fitting with previous method. lower right: TransMIL and full self-attention (FSA) get a relatively low performance during testing on unseen larger WSI. Assisted by our method, this case show significant performance improvement (p-value near 0.1).
  • Figure 5: Difference and similarity between various methods. upper left: HIPT slicing with extremely hard pattern, upper right: our proposed local mask, lower left: 2-d ALiBi, or 2-d Euclid distance, lower right: attention mask of Prov-GigaPath from their paper (their causal attention, only focus on lower triangular matrix, may be a drawing problem). Apparently the local mask of Prov-GigaPath mainly focus on 1-d interactions (weigh x-axis of WSI more than y-axis), e.g. the interactions when distance less than 2.0 are almost missed, as depicted in the red text areas of the lower-left 2-d Euclid distance subfigure. We have checked their code implementation, which directly apply 1-d LongNet to the serialized (via z-scan) patch sequence.
  • ...and 2 more figures