Table of Contents
Fetching ...

Active Learning for Planet Habitability Classification under Extreme Class Imbalance

R. I. El-Kholy, Z. M. Hayman

TL;DR

This study explores the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints and indicates that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.

Abstract

The increasing size and heterogeneity of exoplanet catalogs have made systematic habitability assessment challenging, particularly given the extreme scarcity of potentially habitable planets and the evolving nature of their labels. In this study, we explore the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints. We construct a unified dataset from the Habitable World Catalog and the NASA Exoplanet Archive and formulate habitability assessment as a binary classification problem. A supervised baseline based on gradient-boosted decision trees is established and optimized for recall in order to prioritize the identification of rare potentially habitable planets. This model is then embedded within an active learning framework, where uncertainty-based margin sampling is compared against random querying across multiple runs and labeling budgets. We find that active learning substantially reduces the number of labeled instances required to approach supervised performance, demonstrating clear gains in label efficiency. To connect these results to a practical astronomical use case, we aggregate predictions from independently trained active-learning models into an ensemble and use the resulting mean probabilities and uncertainties to rank planets originally labeled as non-habitable. This procedure identifies a single robust candidate for further study, illustrating how active learning can support conservative, uncertainty-aware prioritization of follow-up targets rather than speculative reclassification. Our results indicate that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.

Active Learning for Planet Habitability Classification under Extreme Class Imbalance

TL;DR

This study explores the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints and indicates that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.

Abstract

The increasing size and heterogeneity of exoplanet catalogs have made systematic habitability assessment challenging, particularly given the extreme scarcity of potentially habitable planets and the evolving nature of their labels. In this study, we explore the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints. We construct a unified dataset from the Habitable World Catalog and the NASA Exoplanet Archive and formulate habitability assessment as a binary classification problem. A supervised baseline based on gradient-boosted decision trees is established and optimized for recall in order to prioritize the identification of rare potentially habitable planets. This model is then embedded within an active learning framework, where uncertainty-based margin sampling is compared against random querying across multiple runs and labeling budgets. We find that active learning substantially reduces the number of labeled instances required to approach supervised performance, demonstrating clear gains in label efficiency. To connect these results to a practical astronomical use case, we aggregate predictions from independently trained active-learning models into an ensemble and use the resulting mean probabilities and uncertainties to rank planets originally labeled as non-habitable. This procedure identifies a single robust candidate for further study, illustrating how active learning can support conservative, uncertainty-aware prioritization of follow-up targets rather than speculative reclassification. Our results indicate that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.
Paper Structure (16 sections, 7 equations, 9 figures, 2 tables)

This paper contains 16 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Distribution of habitability labels after cross-matching the Habitable Worlds Catalog with confirmed exoplanets from the NASA Exoplanet Archive. The dataset is highly imbalanced, with potentially habitable planets representing a small fraction of the confirmed population.
  • Figure 2: Lower-triangular Pearson correlation matrix of the final feature set used in this study. Each cell shows the Pearson correlation coefficient between a pair of planetary or stellar properties, with the color scale ranging from $-1$ (strong negative correlation) to $+1$ (strong positive correlation). The diagonal elements represent self-correlations. To improve readability, only moderate to strong correlations ($|r|\ge0.5$) are annotated.
  • Figure 3: UpSet plot showing missing-value patterns across the final feature set. The horizontal bars on the left indicate the total number of missing values for each feature. Each column in the central matrix represents a specific combination of features that are simultaneously missing, marked by filled black circles connected by vertical lines. The vertical bars above the matrix show the number of data instances exhibiting each missingness pattern (intersection size). For example, the tallest bar on the far left corresponds to instances where only orbital eccentricity is missing, while all other features are present. In contrast, columns with multiple connected black circles indicate instances where several stellar and planetary properties are missing together.
  • Figure 4: Class-conditional distributions of selected planetary and stellar parameters in the final dataset. Kernel density estimates are shown for planet radius, orbital eccentricity, and ESI; and in log scale for planet mass, incident stellar flux, and equilibrium temperature. The figure illustrates the substantial overlap between the two classes across most individual parameters, as well as systematic shifts in location and spread for several features, motivating the need for multivariate classification.
  • Figure 5: Diagnostics of orbital eccentricity imputation. Panel (a) shows a comparison between the observed eccentricity distribution and the distribution of imputed values, showing close agreement and preservation of the empirical shape; while panel (b) illustrates imputation uncertainty, measured as the standard deviation across bootstrap imputations, as a function of the imputed eccentricity; higher uncertainty at larger eccentricities reflects data sparsity rather than model instability; and panel (c) shows the distribution of test-set MAE across multiple bootstrap imputations, demonstrating stable and consistent predictive performance.
  • ...and 4 more figures