Table of Contents
Fetching ...

NuNext: Reframing Nucleus Detection as Next-Point Detection

Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, Cheng Tan

TL;DR

This work reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image.

Abstract

Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.

NuNext: Reframing Nucleus Detection as Next-Point Detection

TL;DR

This work reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image.

Abstract

Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.
Paper Structure (15 sections, 19 equations, 5 figures, 5 tables)

This paper contains 15 sections, 19 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Pipeline comparison between nucleus detection paradigms. (a) Density map-based methods require complicated post-processing, which is hyper-parameter sensitive and vulnerable to noise. (b) Anchor-based and (c) query-based methods suffer from severe foreground-background imbalance, as the majority of anchors/queries are assigned to background. The histogram shows the foreground proportion across the large-scale PanNuke dataset, which is below 4.5% for over 90% of images. (d) Our proposed NuNext circumvents these issues by directly predicting nuclei coordinates.
  • Figure 2: The training pipeline of NuNext. (Left) In the SFT stage, the model is trained to generate nucleus coordinate tokens with chain-of-visual-thought (CoVT) to incorporate visual cues of nuclei regions, and spatial-aware soft supervision that credits spatially proximate predictions. (Right) In the RFT stage, multiple rollouts are sampled per input, verified for format correctness, and scored with distribution matching and task-guided rewards. The model is then optimized via GRPO with fine-grained advantage shaping (FGAS).
  • Figure 3: (Left) Motivation for low-variance group filtering: GRPO standardization can inflate advantages when within-group reward difference is negligible. (Right) Illustration of FGAS. Predicted and ground-truth nuclei are first matched via the Hungarian algorithm, then validated against a distance threshold to determine true/false positives. The resulting token-level labels are used to shape the sequence-level advantage, reducing the advantage for false positive tokens in rollouts with $A>0$ and alleviating the penalty for true positive tokens when $A<0$.
  • Figure 4: Qualitative comparison with SOTA methods.
  • Figure 5: Similarity heatmap between latent tokens and visual features. High-response regions align well with nuclei areas.