Table of Contents
Fetching ...

Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

J. Gonzalez, T. Collett, K. Rojas, K. Bechtol, J. A. Acevedo Barroso, A. Melo, A. More, D. Sluse, C. Tortora, P. Holloway, N. E. P. Lines, A. Verma

TL;DR

The paper compares three independent ML pipelines for strong lens searches in the DES and assesses their performance using expert visual classifications on a common set of 1651 SLED candidates. González’s Vision Transformer with interactive learning achieves the highest $F_{1}$-score ($F_{1}=0.54$) and AUROC ($=0.85$), while Jacobs and Rojas trail with $F_{1}$ of 0.31 and 0.35, respectively. Ensemble strategies (e.g., decision tree, random forest, Independent Bayesian) substantially improve maximum $F_{1}$ to about 0.67 and can boost precision at fixed completeness, underscoring the value of combining diverse classifiers. The results demonstrate that diverse ML ensembles can maximize lens completeness while suppressing false positives, providing actionable guidance for scalable lens searches in current and upcoming wide-field surveys.

Abstract

We present a systematic comparison of three independent machine learning (ML)-based searches for strong gravitational lenses applied to the Dark Energy Survey (Jacobs et al. 2019a,b; Rojas et al. 2022; Gonzalez et al. 2025). Each search employs a distinct ML architecture and training strategy, allowing us to evaluate their relative performance, completeness, and complementarity. Using a visually inspected sample of 1651 systems previously reported as lens candidates, we assess how each model scores these systems and quantify their agreement with expert classifications. The three models show progressive improvement in performance, with F1-scores of 0.31, 0.35, and 0.54 for Jacobs, Rojas, and Gonzalez, respectively. Their completeness for moderate- to high-confidence lens candidates follows a similar trend (31%, 52%, and 70%). When combined, the models recover 82% of all such systems, highlighting their strong complementarity. Additionally, we explore ensemble strategies: average, median, linear regression, decision trees, random forests, and an Independent Bayesian method. We find that all but averaging achieve higher maximum F1 scores than the best individual model, with some ensemble methods improving precision by up to a factor of six. These results demonstrate that combining multiple, diverse ML classifiers can substantially improve the completeness of lens samples while drastically reducing false positives, offering practical guidance for optimizing future ML-based strong lens searches in wide-field surveys.

Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

TL;DR

The paper compares three independent ML pipelines for strong lens searches in the DES and assesses their performance using expert visual classifications on a common set of 1651 SLED candidates. González’s Vision Transformer with interactive learning achieves the highest -score () and AUROC (), while Jacobs and Rojas trail with of 0.31 and 0.35, respectively. Ensemble strategies (e.g., decision tree, random forest, Independent Bayesian) substantially improve maximum to about 0.67 and can boost precision at fixed completeness, underscoring the value of combining diverse classifiers. The results demonstrate that diverse ML ensembles can maximize lens completeness while suppressing false positives, providing actionable guidance for scalable lens searches in current and upcoming wide-field surveys.

Abstract

We present a systematic comparison of three independent machine learning (ML)-based searches for strong gravitational lenses applied to the Dark Energy Survey (Jacobs et al. 2019a,b; Rojas et al. 2022; Gonzalez et al. 2025). Each search employs a distinct ML architecture and training strategy, allowing us to evaluate their relative performance, completeness, and complementarity. Using a visually inspected sample of 1651 systems previously reported as lens candidates, we assess how each model scores these systems and quantify their agreement with expert classifications. The three models show progressive improvement in performance, with F1-scores of 0.31, 0.35, and 0.54 for Jacobs, Rojas, and Gonzalez, respectively. Their completeness for moderate- to high-confidence lens candidates follows a similar trend (31%, 52%, and 70%). When combined, the models recover 82% of all such systems, highlighting their strong complementarity. Additionally, we explore ensemble strategies: average, median, linear regression, decision trees, random forests, and an Independent Bayesian method. We find that all but averaging achieve higher maximum F1 scores than the best individual model, with some ensemble methods improving precision by up to a factor of six. These results demonstrate that combining multiple, diverse ML classifiers can substantially improve the completeness of lens samples while drastically reducing false positives, offering practical guidance for optimizing future ML-based strong lens searches in wide-field surveys.

Paper Structure

This paper contains 16 sections, 2 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Distribution of normalized ranks assigned by each search effort for all astronomical targets processed by the three ML models (Intersection sample). The histogram illustrates differences in score distributions.
  • Figure 2: 2D histograms comparing the ML normalized ranks from the three ML models for systems reported as strong lensing candidates in the SLED database. Each row corresponds to a different range of SLED score, which indicates the confidence level of the candidates (0 to 3). This visualization examines how ML models score candidates with varying confidence levels.
  • Figure 3: Screenshot from the visual inspection project on Zooniverse. Strong lensing experts were shown each system in four distinct PNG settings designed to highlight different image features. Experts were asked to categorize each system into four classes: A-certain lens, B-probable lens, C-could be a lens and Z-not a lens.
  • Figure 4: Histogram of the Expert Scores assigned to systems previously reported in the SLED database as strong lensing candidates. The scores were determined through visual inspection of images from the second DES data release. To reflect the discrete nature of the Expert Scores, each bar corresponds to a unique score value rather than to a continuous histogram bin.
  • Figure 5: Random selection of eight systems from each category (A, B, C, Z). Within each row, systems are ordered from left to right by increasing Expert Score. This figure provides an intuitive visual sense of the typical characteristics within each category.
  • ...and 8 more figures