Table of Contents
Fetching ...

Uni-Hema: Unified Model for Digital Hematopathology

Abdul Rehman, Iqra Rasool, Ayisha Imran, Mohsen Ali, Waqas Sultani

TL;DR

Uni-Hema tackles the absence of a single model capable of multi-task, multi-modal analysis in digital hematopathology. It integrates detection, classification, segmentation, morphology prediction, and visual–textual reasoning within a unified architecture aided by Hema-Former, trained on 46 public datasets (≈700K images, ≈21K QA pairs). The model achieves competitive or superior results to task-specific SOTA across multiple tasks and demonstrates interpretable single-cell morphology insights, including VQA and MLM capabilities on hematology data. This unified approach enables scalable, cross-disease digital hematopathology with potential clinical impact, and the authors plan to release code publicly.

Abstract

Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.

Uni-Hema: Unified Model for Digital Hematopathology

TL;DR

Uni-Hema tackles the absence of a single model capable of multi-task, multi-modal analysis in digital hematopathology. It integrates detection, classification, segmentation, morphology prediction, and visual–textual reasoning within a unified architecture aided by Hema-Former, trained on 46 public datasets (≈700K images, ≈21K QA pairs). The model achieves competitive or superior results to task-specific SOTA across multiple tasks and demonstrates interpretable single-cell morphology insights, including VQA and MLM capabilities on hematology data. This unified approach enables scalable, cross-disease digital hematopathology with potential clinical impact, and the authors plan to release code publicly.

Abstract

Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Uni-Hema: A unified architecture supporting diverse digital hematopathology tasks; including cell detection, morphology prediction and cell segmentation in both single-cell and complete field-of-view (FoV) images, as well as visual question answering, and masked language modeling.
  • Figure 2: Uni-Hema model architecture comprises six principal modules: ($\mathbf{B}$) an image backbone for extracting spatial features, ($\mathbf{E}_\mathbf{I}$) an image encoder for hierarchical visual embeddings, ($\mathbf{D}_\mathbf{I}$) an image decoder for cell detection and morphology prediction; ($\mathbf{E}_\mathbf{T}$) a text encoder for extracting textual features with respect tasks, ($\mathbf{D}_\mathbf{T}$) a text decoder for answer and sentence generation, ($\mathcal{H}$) and a Hema-former module, which serves as the core of the model by bridging visual and textual representations employing four submodules (see section \ref{['fig:hema_former']}).
  • Figure 3: Hema-Former sub-modules: (a) Cross modal fusion aligns text feature queries ($\mathcal{E}_\mathbf{E}^\mathbf{T}$) and image feature queries ($\mathcal{E}_\mathbf{E}^\mathbf{I}$) incorporate learnable queries ($\mathbf{Q}_1$) (b) Text-guided refinement that updates top-K object queries through cross-attention with text queries (c) Single cell feature extractor aggregates mean-pooled encoder features ($\mathcal{E}_\mathbf{E}^\mathbf{I}$) with learnable query ($\mathbf{Q}_2$) for single-cell classification, and (d) Query-guided mask former (QMF) module that utilizes the fused backbone features ($\mathcal{F}_1$) and spatially projected encoder features ($\mathcal{E}_\mathbf{E}^\mathbf{I}$) to predict segmentation masks ($\mathbf{Y}$),.
  • Figure 4: (a) Segmentation results of TransNetR and our method on anemia, malaria, and WBC images. TP, FP, and FN are shown in light yellow, blue, and red. Our method reduces false detections and improves localization, especially for anemia and WBC, while handling malaria robustly. (b) VQA outputs show contextually accurate predictions aligned with ground truth. (c) t-SNE plots of DinoBloom-S and Uni-Hema features on the Acevedo (8-class) and Raabin (5-class) datasets display clearer class separation with Uni-Hema.