Table of Contents
Fetching ...

Revealing Hidden Repeaters in the CHIME/FRB Catalog: Semi-Supervised Insights into the Fast Radio Burst Population

N. Mankatwit, P. Thongkonsing, S. Loekkesee, P. Chainakun, W. Luangtip, S. Sanpa-arsa

Abstract

Fast radio bursts (FRBs) are millisecond-duration extragalactic transients, observationally classified as repeaters or nonrepeaters. This classification may be biased, as some apparently non-repeating sources could simply have undetected subsequent bursts. To address this, we develop a semi-supervised learning framework to identify distinguishing features of repeaters using primary observational parameters from the Blinkverse database, which draws from the CHIME/FRB Catalogs. The framework combines labeled data (known repeaters and confidently classified non-repeaters) with unlabeled sources previously flagged as non-repeaters but exhibiting repeater-like characteristics. We employ uniform manifold approximation and projection with a nearest-neighbor scheme to select potential candidates, followed by semi-supervised classification using five base estimators, including random forest, support vector machine, logistic regression, AdaBoost, and Gradient boost. Each model is fine-tuned through cross-validation, and a voting strategy among the five models is employed to enhance robustness. All models achieve consistently high performance, identifying dispersion measure, peak frequency, and fluence as the most discriminative features. Repeaters tend to show lower dispersion measures, higher peak frequencies, and higher fluences than non-repeaters. We also identify a set of candidate repeaters, several of which are consistent with prior independent studies. Our approach can identify 36 additional repeater candidates that conventional methods may have missed. Finally, the results highlight dispersion measure as a key discriminator between repeaters and non-repeaters, revealing a tension between physical and instrumental origins-either environmental effects, if the two populations arise from distinct progenitors, or detection bias, as nearby sources are more easily observed.

Revealing Hidden Repeaters in the CHIME/FRB Catalog: Semi-Supervised Insights into the Fast Radio Burst Population

Abstract

Fast radio bursts (FRBs) are millisecond-duration extragalactic transients, observationally classified as repeaters or nonrepeaters. This classification may be biased, as some apparently non-repeating sources could simply have undetected subsequent bursts. To address this, we develop a semi-supervised learning framework to identify distinguishing features of repeaters using primary observational parameters from the Blinkverse database, which draws from the CHIME/FRB Catalogs. The framework combines labeled data (known repeaters and confidently classified non-repeaters) with unlabeled sources previously flagged as non-repeaters but exhibiting repeater-like characteristics. We employ uniform manifold approximation and projection with a nearest-neighbor scheme to select potential candidates, followed by semi-supervised classification using five base estimators, including random forest, support vector machine, logistic regression, AdaBoost, and Gradient boost. Each model is fine-tuned through cross-validation, and a voting strategy among the five models is employed to enhance robustness. All models achieve consistently high performance, identifying dispersion measure, peak frequency, and fluence as the most discriminative features. Repeaters tend to show lower dispersion measures, higher peak frequencies, and higher fluences than non-repeaters. We also identify a set of candidate repeaters, several of which are consistent with prior independent studies. Our approach can identify 36 additional repeater candidates that conventional methods may have missed. Finally, the results highlight dispersion measure as a key discriminator between repeaters and non-repeaters, revealing a tension between physical and instrumental origins-either environmental effects, if the two populations arise from distinct progenitors, or detection bias, as nearby sources are more easily observed.

Paper Structure

This paper contains 12 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Box plots display the distribution of each feature for non-repeating FRBs (class 0) and repeating FRBs (class 1). The box shows the interquartile range (IQR), spanning from the first quartile (Q1) to third quartile (Q3), with a central line marking the median. Whiskers extend to values within 1.5 times the IQR from the quartiles. Points beyond the whiskers are identified as outliers. Note that these data are plotted directly from the original dataset and still do not account for the possibility that some non-repeaters may be hidden repeaters.
  • Figure 2: Workflow of the semi-supervised learning pipeline using self-training. The initial dataset comprises two classes: Class 0 (non-repeaters) and Class 1 (repeaters). To identify ambiguous non-repeating sources that exhibit repeater-like characteristics, a UMAP-based nearest neighbors approach is applied. These selected Class 0 instances are reassigned to an unlabeled category (Class -1), resulting in a dataset with three classes: 0, 1, and -1. The data is then split into 80% for training and 20% for testing. The labeled portion of the training set (Classes 0 and 1) is used to train five separate base classifiers with five features, each within its own self-training algorithm (blue boxes). Each trained model predicts class probabilities for the unlabeled data. Instances with a predicted probability $\geq 0.8$ are assigned pseudo-labels and merged with the labeled set for retraining, while those below the threshold remain unlabeled (blue arrows). This iterative self-training process continues until either all unlabeled samples are assigned or a maximum iteration limit is reached. The final model from each self-training loop is evaluated on the held-out test set to measure performance and to detect raw repeater candidates (red arrows). A voting process is then applied, wherein a source is recognized as a repeater candidate if at least three out of five base estimators predict it as such. See text for more details.
  • Figure 3: UMAP visualization of the dataset. The top panel displays the original label distribution, with repeaters (Class 1) shown in green and non-repeaters (Class 0) in red. The bottom panel shows the updated labels after neighborhood-based relabeling, where some Class 0 instances exhibiting repeater-like characteristics are reassigned to the unlabeled category (Class -1), shown in gray.
  • Figure 4: UMAP visualization highlighting our repeater candidates. Red circle, green circle, and blue triangle represent non-repeaters, known repeaters, and repeater candidates identified in this study, respectively.
  • Figure 5: SHAP feature importance visualizations at the 1000th iteration for five different classifiers: RF, SVC, LR, Ada, and GB. Each point represents a data instance, colored by the feature value (low to high). The x-axis shows the SHAP value, which quantifies the impact of that feature on the model's prediction for a given sample. A high positive SHAP value indicates a strong contribution toward predicting a repeater (Class 1), while a large negative SHAP value reflects a strong influence toward predicting a non-repeater (Class 0).
  • ...and 4 more figures