Table of Contents
Fetching ...

BALNet: Deep Learning-Based Detection and Measurement of Broad Absorption Lines in Quasar Spectra

Yangyang Li, Zhijian Luo, Shaohua Zhang, Du Wang, Jianzhen Chen, Zhu Chen, Hubing Xiao, Chenggang Shu

TL;DR

BALNet addresses the scalability challenge of BAL trough detection and velocity measurement in large quasar spectroscopic surveys by combining a 1D‑CNN with Bi‑LSTM to detect BAL troughs and extract their kinematic properties directly from spectra. The model is trained on a large, carefully constructed mock dataset derived from SDSS DR16Q, enabling simultaneous BAL quasar classification and trough velocity estimation. On simulated data, BALNet achieves robust performance (BAL trough detection: ~83% completeness, ~90.7% purity; BAL quasar identification: ~90.8% completeness, ~94.4% purity; velocity metrics with f_out ~9%, σ_NMAD < 0.03, bias < 1e−5) and a high AU‑PRC (~0.92). Applied to 446{,}839 DR16Q spectra (1.5 ≤ z ≤ 5.7), BALNet identifies 91{,}164 BAL quasars (20.4% of the sample), including 25{,}123 newly detected BAL quasars and 8.8% redshifted troughs, demonstrating significant gains in detection efficiency and the ability to map BAL populations across wide velocity ranges. The work also provides public code and catalogs, enabling broader studies of quasar outflows and their evolution.

Abstract

Broad absorption line (BAL) quasars serve as critical probes for understanding active galactic nucleus (AGN) outflows, black hole accretion, and cosmic evolution. To address the limitations of manual classification in large-scale spectroscopic surveys - where the number of quasar spectra is growing exponentially - we propose BALNet, a deep learning approach consisting of a one-dimensional convolutional neural network (1D-CNN) and bidirectional long short-term memory (Bi-LSTM) networks to automatically detect BAL troughs in quasar spectra. BALNet enables both the identification of BAL quasars and the measurement of their BAL troughs. We construct a simulated dataset for training and testing by combining non-BAL quasar spectra and BAL troughs, both derived from SDSS DR16 observations. Experimental results in the testing set show that: (1) BAL trough detection achieves 83.0% completeness, 90.7% purity, and an F1-score of 86.7%; (2) BAL quasar classification achieves 90.8% completeness and 94.4% purity; (3) the predicted BAL velocities agree closely with simulated ground truth labels, confirming BALNet's robustness and accuracy. When applied to the SDSS DR16 data within the redshift range 1.5<z<5.7, at least one BAL trough is detected in 20.4% of spectra. Notably, more than a quarter of these are newly identified sources with significant absorption, 8.8% correspond to redshifted systems, and some narrow/weak absorption features were missed. BALNet greatly improves the efficiency of large-scale BAL trough detection and enables more effective scientific analysis of quasar spectra.

BALNet: Deep Learning-Based Detection and Measurement of Broad Absorption Lines in Quasar Spectra

TL;DR

BALNet addresses the scalability challenge of BAL trough detection and velocity measurement in large quasar spectroscopic surveys by combining a 1D‑CNN with Bi‑LSTM to detect BAL troughs and extract their kinematic properties directly from spectra. The model is trained on a large, carefully constructed mock dataset derived from SDSS DR16Q, enabling simultaneous BAL quasar classification and trough velocity estimation. On simulated data, BALNet achieves robust performance (BAL trough detection: ~83% completeness, ~90.7% purity; BAL quasar identification: ~90.8% completeness, ~94.4% purity; velocity metrics with f_out ~9%, σ_NMAD < 0.03, bias < 1e−5) and a high AU‑PRC (~0.92). Applied to 446{,}839 DR16Q spectra (1.5 ≤ z ≤ 5.7), BALNet identifies 91{,}164 BAL quasars (20.4% of the sample), including 25{,}123 newly detected BAL quasars and 8.8% redshifted troughs, demonstrating significant gains in detection efficiency and the ability to map BAL populations across wide velocity ranges. The work also provides public code and catalogs, enabling broader studies of quasar outflows and their evolution.

Abstract

Broad absorption line (BAL) quasars serve as critical probes for understanding active galactic nucleus (AGN) outflows, black hole accretion, and cosmic evolution. To address the limitations of manual classification in large-scale spectroscopic surveys - where the number of quasar spectra is growing exponentially - we propose BALNet, a deep learning approach consisting of a one-dimensional convolutional neural network (1D-CNN) and bidirectional long short-term memory (Bi-LSTM) networks to automatically detect BAL troughs in quasar spectra. BALNet enables both the identification of BAL quasars and the measurement of their BAL troughs. We construct a simulated dataset for training and testing by combining non-BAL quasar spectra and BAL troughs, both derived from SDSS DR16 observations. Experimental results in the testing set show that: (1) BAL trough detection achieves 83.0% completeness, 90.7% purity, and an F1-score of 86.7%; (2) BAL quasar classification achieves 90.8% completeness and 94.4% purity; (3) the predicted BAL velocities agree closely with simulated ground truth labels, confirming BALNet's robustness and accuracy. When applied to the SDSS DR16 data within the redshift range 1.5<z<5.7, at least one BAL trough is detected in 20.4% of spectra. Notably, more than a quarter of these are newly identified sources with significant absorption, 8.8% correspond to redshifted systems, and some narrow/weak absorption features were missed. BALNet greatly improves the efficiency of large-scale BAL trough detection and enables more effective scientific analysis of quasar spectra.

Paper Structure

This paper contains 14 sections, 8 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Distributions of the number of C4 BAL troughs per spectrum for 23,107 observed BAL quasars (black solid line) and for 100,000 simulated BAL quasars (red dashed line).
  • Figure 2: Flowchart illustrating the procedure for constructing simulated spectra. Panel (a) presents a normalized BAL pattern spectrum (black), with the gray-shaded regions marking three randomly selected BAL troughs. Panel (b) displays a real, unabsorbed spectrum (black) and its smoothed version (blue). Panel (c) shows the final simulated spectrum (black) in the bottom part, and the corresponding label vector (red) in the upper part, where BAL regions are labeled as 1 and non-BAL regions as 0.
  • Figure 3: Schematic of the LSTM unit structure, illustrating the four interacting key components (input gate $i$, forget gate $f$, output gate $o$, and candidate memory cell state $\tilde{C}$) and their information flow. The mathematical expressions for the gates are provided in Equation (\ref{['3']}), while the temporal update mechanism of the cell state is strictly defined in Equation (\ref{['7']}). The arrows in the figure clearly indicate the direction of information flow, reflecting the dynamic gating logic of LSTM when processing sequential data.
  • Figure 4: Architecture of the proposed BALNet framework, comprising three core modules: (a) The 1D-CNN feature extractor processes the 1165-dimensional input spectrum through convolutional operations (kernel_size=7, stride=3, 192 filters), transforming it into a 387×192 feature matrix, followed by batch normalization, ReLU activation, and dropout (rate=0.2) for feature refinement; (b) The Bi-LSTM module employs two bidirectional LSTM layers (128 hidden units each) to analyze temporal patterns, producing a 387×256 feature matrix. Each LSTM layer is followed by batch normalization and dropout for regularization; (c) The output module generates a 387-dimensional probability vector through a fully connected layer with sigmoid activation, where each element represents the presence probability of BAL troughs at the corresponding spectral position.
  • Figure 5: Precision-recall (PR) curves for the training set (orange) and the testing set (green). The brown point marks the probability threshold at which the model achieves its optimal performance (Best F1-score, PIXEL_PROB =0.3). The AU-PRC serves as a quantitative measure of the model's performance.
  • ...and 6 more figures