Table of Contents
Fetching ...

Estimating Exoplanet Mass using Machine Learning on Incomplete Datasets

Florian Lalande, Elizabeth Tasker, Kenji Doya

TL;DR

This paper compares the capabilities of five different machine learning algorithms that can utilize multidimensional incomplete datasets to estimate missing properties for imputing planet mass, and finds that imputation results improve with more data even when the additional data is incomplete.

Abstract

The exoplanet archive is an incredible resource of information on the properties of discovered extrasolar planets, but statistical analysis has been limited by the number of missing values. One of the most informative bulk properties is planet mass, which is particularly challenging to measure with more than 70\% of discovered planets with no measured value. We compare the capabilities of five different machine learning algorithms that can utilize multidimensional incomplete datasets to estimate missing properties for imputing planet mass. The results are compared when using a partial subset of the archive with a complete set of six planet properties, and where all planet discoveries are leveraged in an incomplete set of six and eight planet properties. We find that imputation results improve with more data even when the additional data is incomplete, and allows a mass prediction for any planet regardless of which properties are known. Our favored algorithm is the newly developed $k$NN$\times$KDE, which can return a probability distribution for the imputed properties. The shape of this distribution can indicate the algorithm's level of confidence, and also inform on the underlying demographics of the exoplanet population. We demonstrate how the distributions can be interpreted with a series of examples for planets where the discovery was made with either the transit method, or radial velocity method. Finally, we test the generative capability of the $k$NN$\times$KDE to create a large synthetic population of planets based on the archive, and identify potential categories of planets from groups of properties in the multidimensional space. All codes are Open Source.

Estimating Exoplanet Mass using Machine Learning on Incomplete Datasets

TL;DR

This paper compares the capabilities of five different machine learning algorithms that can utilize multidimensional incomplete datasets to estimate missing properties for imputing planet mass, and finds that imputation results improve with more data even when the additional data is incomplete.

Abstract

The exoplanet archive is an incredible resource of information on the properties of discovered extrasolar planets, but statistical analysis has been limited by the number of missing values. One of the most informative bulk properties is planet mass, which is particularly challenging to measure with more than 70\% of discovered planets with no measured value. We compare the capabilities of five different machine learning algorithms that can utilize multidimensional incomplete datasets to estimate missing properties for imputing planet mass. The results are compared when using a partial subset of the archive with a complete set of six planet properties, and where all planet discoveries are leveraged in an incomplete set of six and eight planet properties. We find that imputation results improve with more data even when the additional data is incomplete, and allows a mass prediction for any planet regardless of which properties are known. Our favored algorithm is the newly developed NNKDE, which can return a probability distribution for the imputed properties. The shape of this distribution can indicate the algorithm's level of confidence, and also inform on the underlying demographics of the exoplanet population. We demonstrate how the distributions can be interpreted with a series of examples for planets where the discovery was made with either the transit method, or radial velocity method. Finally, we test the generative capability of the NNKDE to create a large synthetic population of planets based on the archive, and identify potential categories of planets from groups of properties in the multidimensional space. All codes are Open Source.

Paper Structure

This paper contains 42 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: Pairplot for eight planet properties. Grey dots and histogram bars denote the full NASA Exoplanet Archive, while black dots represent the subset of planets used in the complete six properties dataset of TLG2020. Five variables (planet radius, planet mass, planet orbital period, planet equilibrium temperature, and stellar mass) have been log-transformed and are used as such in this study.
  • Figure 2: Pearson correlation coefficients for the eight chosen planet properties in the extended dataset. These values quantify the direction and the magnitude of pair-wise linear relationships between the exoplanet properties. Note that there exist no systematic way for interpreting Pearson correlation coefficients.
  • Figure 3: Test results when using the complete properties dataset where the 150 test planets are treated as transit observations, with missing mass values. The left-hand plot shows four proposed imputation algorithms alongside the mBM code in TLG2020. The right-hand plot shows the comparison between the observed mass and imputed mass for the mBM code and the $k$NN$\times$KDE algorithm. The figure legend shows the average error across all 150 plotted planets. The diagonal dashed line marks a perfect correspondence between the observed and imputed values. The distributions of the three planets marked in the right-hand legend are shown below.
  • Figure 4: Distributions of the imputed mass values calculated with the $k$NN$\times$KDE for the three planets highlighted in Figure \ref{['fig:cd_transit_mass']}, selected for their low error (top) and high errors (middle and bottom). The red histogram shows the distribution calculated by the $k$NN$\times$KDE, and the gray histogram is the distribution from the mBM in TLG2020. The vertical black line is the observed mass for the planet, while the red and gray vertical lines show the imputed value from the $k$NN$\times$KDE and the mBM, respectively. Top panel shows the mass distribution for HAT-P-57b: a hot Jupiter with a low error for the imputed mass. The middle and lower panes show the mass distributions for Kepler-30c and Kepler-9c, both of which have higher errors due to being unusual planets within the dataset.
  • Figure 5: Test results when using the complete data set where the 150 test planets are treated as radial velocity observations, with missing radii values and a minimum mass measurement. The left-hand plot shows the results for the mass imputation, after the imputed distribution has been weighted by the minimum mass. Red dots show the $k$NN$\times$KDE algorithm results, while the light grey distribution is from the mBM code in TLG2020. The right-hand plot shows the comparison between the observed radius and imputed radius values that correspond to the weighted imputed masses. The mass and radii imputed value distributions for the three highlighted planets are shown in the next figure.
  • ...and 9 more figures