Table of Contents
Fetching ...

Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning

Zhe Xu, Cheng Jin, Yihui Wang, Ziyi Liu, Hao Chen

TL;DR

This paper tackles two core problems in computational pathology: limited diagnostic reasoning in multimodal models and the heavy computation required for processing high-resolution whole-slide images. It introduces a bilateral reinforcement learning framework with a task performer that learns pathology rationales through RL and a token allocator that adaptively assigns tokens to images and prompts, balancing accuracy and efficiency. Across VQA, cancer subtyping, and lesion detection on six datasets, the approach achieves a +41.7-point performance gain while cutting inference costs by 70.3% relative to baselines, demonstrating both improved reasoning and practical efficiency. The work offers interpretable diagnostic reasoning and scalable computation, highlighting strong potential for clinical deployment and setting a path for future expansion to larger models and whole-slide analysis.

Abstract

Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.

Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning

TL;DR

This paper tackles two core problems in computational pathology: limited diagnostic reasoning in multimodal models and the heavy computation required for processing high-resolution whole-slide images. It introduces a bilateral reinforcement learning framework with a task performer that learns pathology rationales through RL and a token allocator that adaptively assigns tokens to images and prompts, balancing accuracy and efficiency. Across VQA, cancer subtyping, and lesion detection on six datasets, the approach achieves a +41.7-point performance gain while cutting inference costs by 70.3% relative to baselines, demonstrating both improved reasoning and practical efficiency. The work offers interpretable diagnostic reasoning and scalable computation, highlighting strong potential for clinical deployment and setting a path for future expansion to larger models and whole-slide analysis.

Abstract

Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.

Paper Structure

This paper contains 22 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of our framework’s ability to discover underlying pathology rationale and optimize token allocation, enabling efficient and accurate multimodal reasoning for tasks such as VQA, cancer subtyping, and lesion detection.
  • Figure 2: Overall framework of our method. High-resolution pathological images are first preprocessed into tiled patches and, together with task prompts, encoded into tokens. A task performer processes these tokens under SFT and GRPO supervision, while a token allocator dynamically adjusts the token budget via reinforcement learning.
  • Figure 3: Qualitative results of pathological reasoning.
  • Figure 4: Static token budgets result in suboptimal efficiency-accuracy trade-offs across diverse diagnostic scenarios.
  • Figure 5: Diagnostic divergence in interpreting pigmented hair follicle structures. Qwen2.5-VL (2142 image tokens) and SFT misclassifies the dark area as "pigment deposit", whereas task performer (256 tokens) identifies the correct "hair shaft" by referencing keratin/melanin properties under H&E staining. The token allocator’s reduction to 128 tokens highlights computational efficiency without compromising diagnostic fidelity.
  • ...and 2 more figures