Table of Contents
Fetching ...

Detecting False Positives With Derived Planetary Parameters: Experimenting with the KEPLER Dataset

Ayan Bin Rafaih, Zachary Murray

TL;DR

The paper investigates whether derived planetary parameters, rather than full light curves, can effectively identify false positives in Kepler transit data. By evaluating Logistic Regression, Random Forest, SVM, and CNNs on a 9-feature derived parameter set, the study finds that RF and CNNs nearly match the information content of the light curves, achieving up to approximately 92% validation accuracy and strong PR-F1 performance. The results show that simple models can miss subtleties, while CNNs offer the best overall performance, though with higher variability, and that the approach excels particularly for stellar eclipse-related false positives. This lightweight, parameter-focused strategy enables fast, scalable vetting suitable for large datasets and potential application to future missions like TESS.

Abstract

Recent developments in computational power and machine learning techniques motivate their use in many different astrophysical research areas. Consequently, many machine learning models have been trained to classify exoplanet transit signals - typically done by using time series light curves. In this work, we attempt a different approach and try to improve the efficiency of these algorithms by fitting only derived planetary parameters, instead of full time-series light curves. We investigate and evaluate 4 models (Logistic Regression, Random Forest, Support Vector Machines, and Convolutional Neural Networks) on the KEPLER dataset, using precision-recall trade-off and accuracy metrics. We show that this approach can identify up to about 90% of false positives, implying the planetary parameters encompass most of the relevant information contained in a light curve. Random Forest and Convolutional Neural Networks produce the highest accuracy and the best precision-recall trade-off. We also note that the accuracies as a function of the stellar eclipse flag SS have the best performance.

Detecting False Positives With Derived Planetary Parameters: Experimenting with the KEPLER Dataset

TL;DR

The paper investigates whether derived planetary parameters, rather than full light curves, can effectively identify false positives in Kepler transit data. By evaluating Logistic Regression, Random Forest, SVM, and CNNs on a 9-feature derived parameter set, the study finds that RF and CNNs nearly match the information content of the light curves, achieving up to approximately 92% validation accuracy and strong PR-F1 performance. The results show that simple models can miss subtleties, while CNNs offer the best overall performance, though with higher variability, and that the approach excels particularly for stellar eclipse-related false positives. This lightweight, parameter-focused strategy enables fast, scalable vetting suitable for large datasets and potential application to future missions like TESS.

Abstract

Recent developments in computational power and machine learning techniques motivate their use in many different astrophysical research areas. Consequently, many machine learning models have been trained to classify exoplanet transit signals - typically done by using time series light curves. In this work, we attempt a different approach and try to improve the efficiency of these algorithms by fitting only derived planetary parameters, instead of full time-series light curves. We investigate and evaluate 4 models (Logistic Regression, Random Forest, Support Vector Machines, and Convolutional Neural Networks) on the KEPLER dataset, using precision-recall trade-off and accuracy metrics. We show that this approach can identify up to about 90% of false positives, implying the planetary parameters encompass most of the relevant information contained in a light curve. Random Forest and Convolutional Neural Networks produce the highest accuracy and the best precision-recall trade-off. We also note that the accuracies as a function of the stellar eclipse flag SS have the best performance.

Paper Structure

This paper contains 11 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We show the distribution of $\sigma$ (koi_score) across the categories provided in the disposition. We note that the false positives remain the same in number, with high confidence in their dispositions. On the other hand, most of the candidates either preserve their vetting status, or become confirmed exoplanets through dispositional vetting methods employed outside of the Kepler data. There's a very small, thin spread of candidates across the entire range of koi_score, which is however insignificant. For example, we note that there are 2,736 candidates that changed their vetting status to "Confirmed", while the remaining 1,981 candidates most have a koi_score $\sigma > 0.80$.
  • Figure 2: Top: Correlation matrix of the features. Bottom: Cumulative variance explained by principal components. Adjacent to the diagonal of perfect correlations, we can see a concentrated region that exhibits relatively significant correlations between three of the features: transit depth, transit signal-to-noise ratio and the odd-even depth comparison statistic. Across the upper and left edge, there's a relatively weak region of correlation with the target disposition feature. (bottom) The cumulative variance plot shows the total variance provided by the first $N$ principal components. Since the curve doesn't fully plateau to a constant variance till the 8th component, we can conclude that each of the components provides some level of variance to the dataset and can't be fully discarded.
  • Figure 3: Pairplots for the training set with logistic regression ground-truth data points. The plots are color-coded as follows: red plots represent false positives and blue plots represent candidates. Some of the subplots (koi_model_snr vs koi_depth, koi_duration vs koi_model_snr) show a clear, discrete difference in distribution between each of the plots since the logistic regression model is able to separate the features effectively and cluster the training predictions in separate regions. For example for koi_model_snr vs koi_depth, the model classifies signals with a higher transit depth and a higher odd-even comparison statistic as false positives, as the two parameters are intrinsically linked to the same transit properties since the transit-signal-to noise ratio is represented as a standard value, calculated by taking the average of the mean flux measurements. Note that we only show koi_sma for its direct relationship to koi_period, due to Kepler's law, hence it's not included for the actual training set since it's just a degenerate parameter. We also see a very small, concentrated distribution of the stellar radii in the koi_srad graph.
  • Figure 4: We use a standard logit transformation, defined by $\log{(\frac{p}{1-p})}$, on the y-axis for each of them to help linearise the model's decision function. The spread and distribution of the continuous probability predictions from the Logistic Regression model are shown here in the parameter space, which are made through 100 bootstrap iteration, which are all the same size as the training set, each one involving an individual logistic fit. The logistic plots show a varied, non-linear distribution of the predictions across the parameter space, which is more consistent with the nature of the features and their relative correlations.
  • Figure 5: Plot for the Metrics against the threshold that differentiates between the false positive transit signals and the candidates transit signals. There are two dotted lines for thresholds of note: one on the conventional 0.50 where the model seems to sacrifice precision for a greater accuracy score. The 0.50 line intersects them at the point, where the $F_{1}$-score is above the accuracy and precision and the recall being a local maxima at that threshold. The second dotted line presents a first case for an optimal threshold where we consider the algebraic case defined in section \ref{['logisticsectionref']}, when precisions equals recall. Therefore, it represents a possible intersection points for the precision, recall and $F_{1}$ score metrics. The accuracy metrics, due to its nature, doesn't intersect at a common point, which could be denoted as the balance between all 4, which is simply not possible, both algebraically and practically.
  • ...and 3 more figures