Table of Contents
Fetching ...

A search for new symbiotic stars in the Milky Way: Using machine learning techniques applied to photometric databases

V. Contreras Rojas, M. Jaque Arancibia, C. E. Ferreira Lopes, N. Monsalves, R. Angeloni, G. J. M. Luna, V. Marels, D. Concha, N. E. Nunez, C. Saffe, M. Flores

TL;DR

Symbiotic stars are rare yet informative interacting binaries, and their Galactic census remains incomplete. The authors deploy a supervised machine-learning pipeline that fuses Gaia DR3, 2MASS, and WISE photometry with parallaxes and Hα information to target S-type SySts, training a Random Forest with SMOTE on 166 confirmed S-type systems and 1,600 non-symbiotic stars. Applied to roughly 2.5 million candidates, the method identifies 990 high-probability SySts, from which 12 high-confidence objects are selected using physically motivated cuts, all showing properties consistent with S-type SySts and UV excess. Independent validation on recently confirmed systems recovers 92.3% of known S-types, underscoring the robustness and generalizability of the approach and its potential to refine the Galactic SySt census with follow-up spectroscopy.

Abstract

Symbiotic stars (SySts) are interacting binaries composed of a red giant transferring material to a hot compact star, typically a white dwarf. Although only about 300 systems are confirmed, the Galactic population is estimated at 1.2 x 10^3 - 1.5 x 10^4, indicating that most remain undiscovered. We identify new SySts using a machine-learning approach that combines Gaia DR3, 2MASS, and WISE photometry, parallaxes, and the pseudo-equivalent width of H alpha. A Random Forest model was trained on 166 confirmed S-type SySts and 1600 non-symbiotic stars, applying SMOTE to mitigate class imbalance. The model achieved an F1-score of 89% for the symbiotic class. Applied to 2.5 x 10^6 color-selected sources, it identified 990 candidates with probabilities more than 70%. We further refined the sample using physically motivated cuts on effective temperature, surface gravity, metallicity, and SkyMapper photometry, yielding 12 high-confidence candidates. These objects show cool temperatures, low surface gravities, near-solar metallicity, H alpha emission, moderate-to-high luminosities, and UV excess consistent with S-type SySts. Validation on recently confirmed systems recovered 92.3%, demonstrating the robustness and generalizability of our method.

A search for new symbiotic stars in the Milky Way: Using machine learning techniques applied to photometric databases

TL;DR

Symbiotic stars are rare yet informative interacting binaries, and their Galactic census remains incomplete. The authors deploy a supervised machine-learning pipeline that fuses Gaia DR3, 2MASS, and WISE photometry with parallaxes and Hα information to target S-type SySts, training a Random Forest with SMOTE on 166 confirmed S-type systems and 1,600 non-symbiotic stars. Applied to roughly 2.5 million candidates, the method identifies 990 high-probability SySts, from which 12 high-confidence objects are selected using physically motivated cuts, all showing properties consistent with S-type SySts and UV excess. Independent validation on recently confirmed systems recovers 92.3% of known S-types, underscoring the robustness and generalizability of the approach and its potential to refine the Galactic SySt census with follow-up spectroscopy.

Abstract

Symbiotic stars (SySts) are interacting binaries composed of a red giant transferring material to a hot compact star, typically a white dwarf. Although only about 300 systems are confirmed, the Galactic population is estimated at 1.2 x 10^3 - 1.5 x 10^4, indicating that most remain undiscovered. We identify new SySts using a machine-learning approach that combines Gaia DR3, 2MASS, and WISE photometry, parallaxes, and the pseudo-equivalent width of H alpha. A Random Forest model was trained on 166 confirmed S-type SySts and 1600 non-symbiotic stars, applying SMOTE to mitigate class imbalance. The model achieved an F1-score of 89% for the symbiotic class. Applied to 2.5 x 10^6 color-selected sources, it identified 990 candidates with probabilities more than 70%. We further refined the sample using physically motivated cuts on effective temperature, surface gravity, metallicity, and SkyMapper photometry, yielding 12 high-confidence candidates. These objects show cool temperatures, low surface gravities, near-solar metallicity, H alpha emission, moderate-to-high luminosities, and UV excess consistent with S-type SySts. Validation on recently confirmed systems recovered 92.3%, demonstrating the robustness and generalizability of our method.
Paper Structure (17 sections, 2 equations, 9 figures, 6 tables)

This paper contains 17 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Average number of valid measurements per photometric group across the sample. Each bar represents the mean number of available (non-missing) values for the bands belonging to a given instrument or spectral range.
  • Figure 2: Color–color selection process. Left: 2D histogram of approximately 17 988 392 sources obtained from four ADQL queries to the Gaia services and a WISE crossmatch. The first selection cut was applied in the Gaia color–color diagram using a linear regression in logarithmic scale, resulting in the removal of 143 434 sources. Confirmed SySts are marked as orange stars. Right: 2D histogram of over four million sources selected from the 2MASS color–color diagram. The color bar indicates the number of sources per bin in logarithmic scale for both panels.
  • Figure 3: Distribution of training set characteristics for the positive class (S-type SySts) and the negative class ('other').
  • Figure 4: Confusion matrix for the testing set using SMOTE + RF, incorporating photometric colors, parallax, and ${\rm EW}{\rm H\alpha}$. The x-axis represents the predicted class (predicted label), while the y-axis denotes the actual class (true label). Each cell shows the percentage relative to its respective class on the first line, followed by the corresponding number of stars on the second line.
  • Figure 5: Comparison between the mean impurity decay of Random Forest (MDI, orange bars) and the importance of permutation features calculated using the F1 score (black dots).
  • ...and 4 more figures