Table of Contents
Fetching ...

MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

Zijiang Yang, Hanqing Chao, Bokai Zhao, Yelin Yang, Yunshuo Zhang, Dongmei Fu, Junping Zhang, Le Lu, Ke Yan, Dakai Jin, Minfeng Xu, Yun Bian, Hui Jiang

TL;DR

MUSE introduces NuLo, a nucleus-based local self-distillation mechanism, to enable flexible cross-scale self-supervision for nucleus detection and classification. The approach blends a lightweight encoder-decoder ViT backbone with multi-scale patching and a large-field-of-view semi-supervised fine-tuning pipeline, leveraging unlabeled pathology data to learn discriminative nucleus representations. Empirical results on multiple benchmarks show that MUSE outperforms supervised baselines and generic pathology foundation models, demonstrating strong data efficiency and robustness across tissue types and magnifications. The work highlights the importance of task-specific pretraining and cross-scale nucleus-context learning for dense nucleus-level prediction in histopathology.

Abstract

Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

TL;DR

MUSE introduces NuLo, a nucleus-based local self-distillation mechanism, to enable flexible cross-scale self-supervision for nucleus detection and classification. The approach blends a lightweight encoder-decoder ViT backbone with multi-scale patching and a large-field-of-view semi-supervised fine-tuning pipeline, leveraging unlabeled pathology data to learn discriminative nucleus representations. Empirical results on multiple benchmarks show that MUSE outperforms supervised baselines and generic pathology foundation models, demonstrating strong data efficiency and robustness across tissue types and magnifications. The work highlights the importance of task-specific pretraining and cross-scale nucleus-context learning for dense nucleus-level prediction in histopathology.

Abstract

Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

Paper Structure

This paper contains 24 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of MUSE and SOTA methods. (a) Compared with pathology pretraining methods, MUSE achieves better nucleus classification performance with smaller backbones and only 0.5 million samples. (b) After fine-tuning, MUSE outperforms SOTA methods on nucleus detection and classification. Lym., Tum., Oth., and Avg. denote the F1 score of lymphocytes, tumor nucleus, other nucleus, and average, respectively.
  • Figure 2: Motivation of NuLo. Pathologists can flexibly associate nuclei with tissue. This inspires us to introduce two cross-scale self-distillation processes on matched nuclei: (a) inferring tissue information with nuclei detail and (b) inferring nuclei detail with tissue information.
  • Figure 3: Illustration of the architecture. Given a patch, this architecture produces both image-level representation (CLS token) and high-resolution dense representations for multi-scale self-distillation. $R^{(2)}$ denotes the reassembly layer of the second encoder block.
  • Figure 4: Illustration of MUSE. MPP-Based Cropping is first employed to generate paired views based on the ROI patch and random MPP. After data augmentation, we extract image-level representations (CLS tokens, $f_{cls}$) and dense representations (feature maps, $f_{map}$) of the paired views with teacher and student networks. MUSE minimizes two losses: 1) image-level self-distillation between CLS tokens and 2) nucleus-level self-distillation between features of matched nuclei. Specifically, nucleus features are interpolated from the feature maps based on their coordinates.
  • Figure 5: Illustration of the fine-tuning. Annotated samples are first extended to new samples with a larger field of view. The pretrained backbone is then used to extract feature maps, which are further utilized to obtain features at proposal coordinates. $Ign.$ denotes that the coordinate regression is not applied to unlabeled regions.
  • ...and 2 more figures