Table of Contents
Fetching ...

Redshift Classification of Optical Gamma-Ray Bursts using Supervised Learning

Milind Sarkar, Maria Giovanna Dainotti, Nikita S. Khatiya, Dhruv S. Bal, Malgorzata Bogdan, Ye Li, Agnieszka Pollo, Dieter H. Hartmann, Bing Zhang, Simanta Deka, Nissim Fraija, J. Xavier Prochaska

TL;DR

This study develops an optical plateau–based ensemble learning framework to classify GRBs by redshift, addressing spectroscopic incompleteness with rapid probabilistic predictions. The authors curate a dataset of 171 LGRBs with optical plateau measurements, applying rigorous preprocessing (M-estimator outlier removal, MICE imputation) and LASSO feature selection, then train a SuperLearner ensemble across multiple redshift thresholds. The best-performing model (raw data with M-estimator at $z_t=2.0$) achieves high discriminative power (AUC ≈ 0.841; TPR ≈ 0.741) and generalizes well to independent samples (accuracy ≈ 97%, AUC ≈ 0.9338), with a publicly available web app for real-time use. The optical classifier complements X-ray approaches, offering enhanced sensitivity to high-$z$ events while remaining robust to data incompleteness, and sets the stage for multi-wavelength redshift estimation and improved GRB cosmology.

Abstract

Gamma-ray bursts (GRBs) are among the most luminous explosions in the Universe and serve as powerful probes of the early cosmos. However, the rapid fading of their afterglows and the scarcity of spectroscopic measurements make photometric classification crucial for timely high-redshift identification. We present an ensemble machine learning framework for redshift classification of GRBs based solely on their optical plateau and prompt emission properties. Our dataset comprises 171 long GRBs observed by the Swift UVOT and more than 450 ground-based telescopes. The analysis pipeline integrates robust statistical techniques, including M-estimator outlier rejection, multivariate imputation using Multiple Imputation by Chained Equations, and Least Absolute Shrinkage and Selection Operator feature selection, followed by a SuperLearner ensemble combining parametric, semi-parametric, and non-parametric algorithms. The optimal model, trained on raw optical data with outlier removal at a redshift threshold of z equals 2.0, achieves a true positive rate of 74 percent and an area under the curve of 0.84, maintaining balanced generalization between training and test sets. At higher thresholds, such as z equals 3.0, the classifier sustains strong discriminative power with an area under the curve of 0.88. Validation on an independent GRB sample yields 97 percent overall accuracy, perfect specificity, and an ensemble area under the curve of 0.93. Compared to previous prompt- and X-ray-based classifiers, our optical framework offers enhanced sensitivity to high-redshift events, improved robustness against data incompleteness, and greater applicability to ground-based follow-up. We also publicly release a web application that enables real-time redshift classification, facilitating rapid identification of candidate high-redshift GRBs for cosmological studies.

Redshift Classification of Optical Gamma-Ray Bursts using Supervised Learning

TL;DR

This study develops an optical plateau–based ensemble learning framework to classify GRBs by redshift, addressing spectroscopic incompleteness with rapid probabilistic predictions. The authors curate a dataset of 171 LGRBs with optical plateau measurements, applying rigorous preprocessing (M-estimator outlier removal, MICE imputation) and LASSO feature selection, then train a SuperLearner ensemble across multiple redshift thresholds. The best-performing model (raw data with M-estimator at ) achieves high discriminative power (AUC ≈ 0.841; TPR ≈ 0.741) and generalizes well to independent samples (accuracy ≈ 97%, AUC ≈ 0.9338), with a publicly available web app for real-time use. The optical classifier complements X-ray approaches, offering enhanced sensitivity to high- events while remaining robust to data incompleteness, and sets the stage for multi-wavelength redshift estimation and improved GRB cosmology.

Abstract

Gamma-ray bursts (GRBs) are among the most luminous explosions in the Universe and serve as powerful probes of the early cosmos. However, the rapid fading of their afterglows and the scarcity of spectroscopic measurements make photometric classification crucial for timely high-redshift identification. We present an ensemble machine learning framework for redshift classification of GRBs based solely on their optical plateau and prompt emission properties. Our dataset comprises 171 long GRBs observed by the Swift UVOT and more than 450 ground-based telescopes. The analysis pipeline integrates robust statistical techniques, including M-estimator outlier rejection, multivariate imputation using Multiple Imputation by Chained Equations, and Least Absolute Shrinkage and Selection Operator feature selection, followed by a SuperLearner ensemble combining parametric, semi-parametric, and non-parametric algorithms. The optimal model, trained on raw optical data with outlier removal at a redshift threshold of z equals 2.0, achieves a true positive rate of 74 percent and an area under the curve of 0.84, maintaining balanced generalization between training and test sets. At higher thresholds, such as z equals 3.0, the classifier sustains strong discriminative power with an area under the curve of 0.88. Validation on an independent GRB sample yields 97 percent overall accuracy, perfect specificity, and an ensemble area under the curve of 0.93. Compared to previous prompt- and X-ray-based classifiers, our optical framework offers enhanced sensitivity to high-redshift events, improved robustness against data incompleteness, and greater applicability to ground-based follow-up. We also publicly release a web application that enables real-time redshift classification, facilitating rapid identification of candidate high-redshift GRBs for cosmological studies.

Paper Structure

This paper contains 40 sections, 2 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Flowchart detailing each pipeline step, from the initial data to the SuperLearner ensemble model. Yellow boxes show the data engineering technique applied in the pipeline. Green diamonds display the total number of GRBs after each step. Violet boxes and red boxes show the number of GRBs in the training and test sets, respectively. Green boxes highlight the steps involved in the model construction. The orange box represents the outcome of the constructed model, both applied on the test and training sets.
  • Figure 2: Distribution of weights assigned by the M-estimator to each GRB on the raw data. The red vertical line shows the cutoff line of 0.65 for the outliers. GRBs below this cutoff line are considered outliers.
  • Figure 3: Scatter matrix plot for the raw data showing the outliers determined by the M-estimator in cyan and the rest of the data in red.
  • Figure 4: Scatter matrix plot for the MICE-imputed data showing the MICE-imputed data in red and the original data in cyan.
  • Figure 5: The distribution of missing data in our sample. Red boxes highlight GRBs with missing data points, while blue boxes indicate GRBs with complete data for a given variable, as noted on the top axis. The bottom axis enumerates the number of missing variables according to the GRB number shown on the left axis. The left axis represents the count of observations with missing data for specific features. For instance, there are 93 GRBs with complete data, 33 GRBs missing only $\log(\rm{NH})$ values, 2 GRBs missing $\log(PeakFlux_{\rm{err}})$ values, and so on. The right axis indicates the number of features with missing data for each row.
  • ...and 13 more figures