Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

J. Gonzalez; T. Collett; K. Rojas; K. Bechtol; J. A. Acevedo Barroso; A. Melo; A. More; D. Sluse; C. Tortora; P. Holloway; N. E. P. Lines; A. Verma

Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

J. Gonzalez, T. Collett, K. Rojas, K. Bechtol, J. A. Acevedo Barroso, A. Melo, A. More, D. Sluse, C. Tortora, P. Holloway, N. E. P. Lines, A. Verma

TL;DR

The paper compares three independent ML pipelines for strong lens searches in the DES and assesses their performance using expert visual classifications on a common set of 1651 SLED candidates. González’s Vision Transformer with interactive learning achieves the highest $F_{1}$-score ($F_{1}=0.54$) and AUROC ($=0.85$), while Jacobs and Rojas trail with $F_{1}$ of 0.31 and 0.35, respectively. Ensemble strategies (e.g., decision tree, random forest, Independent Bayesian) substantially improve maximum $F_{1}$ to about 0.67 and can boost precision at fixed completeness, underscoring the value of combining diverse classifiers. The results demonstrate that diverse ML ensembles can maximize lens completeness while suppressing false positives, providing actionable guidance for scalable lens searches in current and upcoming wide-field surveys.

Abstract

We present a systematic comparison of three independent machine learning (ML)-based searches for strong gravitational lenses applied to the Dark Energy Survey (Jacobs et al. 2019a,b; Rojas et al. 2022; Gonzalez et al. 2025). Each search employs a distinct ML architecture and training strategy, allowing us to evaluate their relative performance, completeness, and complementarity. Using a visually inspected sample of 1651 systems previously reported as lens candidates, we assess how each model scores these systems and quantify their agreement with expert classifications. The three models show progressive improvement in performance, with F1-scores of 0.31, 0.35, and 0.54 for Jacobs, Rojas, and Gonzalez, respectively. Their completeness for moderate- to high-confidence lens candidates follows a similar trend (31%, 52%, and 70%). When combined, the models recover 82% of all such systems, highlighting their strong complementarity. Additionally, we explore ensemble strategies: average, median, linear regression, decision trees, random forests, and an Independent Bayesian method. We find that all but averaging achieve higher maximum F1 scores than the best individual model, with some ensemble methods improving precision by up to a factor of six. These results demonstrate that combining multiple, diverse ML classifiers can substantially improve the completeness of lens samples while drastically reducing false positives, offering practical guidance for optimizing future ML-based strong lens searches in wide-field surveys.

Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

TL;DR

Abstract

Does Machine Learning Work? A Comparative Analysis of Strong Gravitational Lens Searches in the Dark Energy Survey

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)