Table of Contents
Fetching ...

Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling

Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, Weidong Cai

TL;DR

This work tackles automated discovery of novel biomedical concepts under non-i.i.d. and long-tailed data by introducing geometry-constrained probabilistic modeling on a hyperspherical embedding. It jointly models instance posteriors with a marginal von Mises-Fisher distribution $q(\bm z|\bm x)\sim\text{vMF}(\tilde{\boldsymbol{\mu}}_x,\tilde{\boldsymbol{\kappa}}_x)$, enforces inductive biases of uniformity and boundness via predesigned proxies on $\mathbb{S}^{d-1}$, and disciplines open-space risk through dispersion and structuring losses, while a spectral graph method estimates the number of novel classes. Theoretical analysis links distributional concentration to semantic ambiguity and provides open-space risk bounds, and extensive experiments across pneumonia, cell nuclei, skin lesions, and diabetic retinopathy demonstrate state-of-the-art performance in generalized novel class discovery for biomedical imaging. This framework enables robust open-world discovery in biomedicine despite distribution shifts, offering taxonomy-adaptive class-count estimation and scalable clustering in the hyperspherical latent space.

Abstract

Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.

Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling

TL;DR

This work tackles automated discovery of novel biomedical concepts under non-i.i.d. and long-tailed data by introducing geometry-constrained probabilistic modeling on a hyperspherical embedding. It jointly models instance posteriors with a marginal von Mises-Fisher distribution , enforces inductive biases of uniformity and boundness via predesigned proxies on , and disciplines open-space risk through dispersion and structuring losses, while a spectral graph method estimates the number of novel classes. Theoretical analysis links distributional concentration to semantic ambiguity and provides open-space risk bounds, and extensive experiments across pneumonia, cell nuclei, skin lesions, and diabetic retinopathy demonstrate state-of-the-art performance in generalized novel class discovery for biomedical imaging. This framework enables robust open-world discovery in biomedicine despite distribution shifts, offering taxonomy-adaptive class-count estimation and scalable clustering in the hyperspherical latent space.

Abstract

Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.
Paper Structure (11 sections, 1 theorem, 12 equations, 4 figures, 3 tables)

This paper contains 11 sections, 1 theorem, 12 equations, 4 figures, 3 tables.

Key Result

Proposition 1

Let $\zeta_{\bm{x}}$ be the continuous entropy of the posterior vMF distribution parametrized by $\bm{\tilde{\mu}_x}\in\mathbb{S}^{d-1}$ and $\bm{\tilde{\kappa}_x}\in\mathbb{R}^{d}_{>0}$. We have $\zeta_{\bm{x}}(\bm{\tilde{\kappa}_x})$ behave as a monotonically decreasing function in the interval $(

Figures (4)

  • Figure 1: Conceptual illustration of the main insight. In the biomedical domain, violation of the i.i.d. assumption incurred by inconsistent imaging protocols across cohorts and non-uniform class distributions due to scarcity of rare classes could deteriorate the generalizability of learned representations for novel class discovery. We propose to address those issues via probabilistic modeling on a hyperspherical manifold and incorporation of geometrical inductive biases for countering semantic ambiguity and open space risk.
  • Figure 2: Overview of the proposed method. We propose to incorporate the uniformity and boundness geometrical inductive biases by establishing preorganized proxies as anchors and then structuring the geometric layout of learned embedding space successively with hyperspherical probabilistic modeling.
  • Figure 3: Results of class number estimation in the unlabeled set. For pneumonia and cell nuclei, we further estimate the number of fine-grained subclasses.
  • Figure 4: Hyperspherical and distributional embeddings of the skin lesion test set for two variants of our proposed method. For distributional visualization, we overlay data points over the super-level-set ellipses of the associated probabilistic distributions.

Theorems & Definitions (1)

  • Proposition 1