Table of Contents
Fetching ...

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys

Yufeng Luo, Adam D. Myers, Alex Drlica-Wagner, Dario Dematties, Salma Borchani, Francisco Valdes, Arjun Dey, David Schlegel, Rongpu Zhou, DESI Legacy Imaging Surveys Team

TL;DR

A machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., $E(B-V)<0.04$).

Abstract

As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., $E(B-V)<0.04$). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learning (SSL), with a k-Nearest Neighbor (kNN) classifier. We train and validate our pipeline using a small set of labeled exposures observed by surveys with the Dark Energy Camera (DECam). A clustering-space analysis of where our pipeline places images labeled in ``good'' and ``bad'' categories suggests that our approach can efficiently and accurately determine the quality of exposures. Applied to new imaging being reduced for DECaLS Data Release 11, our pipeline identifies 780 problematic exposures, which we subsequently verify through visual inspection. Being highly efficient and adaptable, our method offers a scalable solution for quality control in other large imaging surveys.

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys

TL;DR

A machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., ).

Abstract

As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., ). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learning (SSL), with a k-Nearest Neighbor (kNN) classifier. We train and validate our pipeline using a small set of labeled exposures observed by surveys with the Dark Energy Camera (DECam). A clustering-space analysis of where our pipeline places images labeled in ``good'' and ``bad'' categories suggests that our approach can efficiently and accurately determine the quality of exposures. Applied to new imaging being reduced for DECaLS Data Release 11, our pipeline identifies 780 problematic exposures, which we subsequently verify through visual inspection. Being highly efficient and adaptable, our method offers a scalable solution for quality control in other large imaging surveys.

Paper Structure

This paper contains 29 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: A set of representative bad exposures in each category. These examples highlight why some common issues, like PSF and NObjects can be hard for visual inspectors to detect. Note that some of the exposures have been scaled to highlight the relevant feature. These sorts of scalings are frequently applied and might affect human experts' judgment, whereas the pipeline we describe in this paper uses raw exposures without any additional processing. It is worth noting that the saturated exposure displays a pattern similar to a flat fielding issue. This is because saturated exposures are mostly taken near twilight when the sky background flux-level will be very high. Therefore, the circular pattern that is visible in a saturated image typically corresponds to the instrument response after the flat-field correction.
  • Figure 2: Continuation of Figure \ref{['fig:example-bad-1']}. This figure shows example exposures in the other 5 categories.
  • Figure 3: The comparison between the original sample of images and the balanced dataset in each category. The "Density" depicts the normalized count of images.
  • Figure 4: A depiction of how our pipeline identifies bad exposures.
  • Figure 5: The clustering of the embeddings generated by the ViT model for the training dataset. The embeddings are processed through the same data processors described in Section \ref{['subsec:training']}, and they are further dimensionally reduced using the t-SNE method tsne to help with visualization. The grey dots represent all training datasets, the blue dots depict "good" exposures, and the orange dots highlight the labeled exposures for each "bad" category. The top-left panel displays the good exposures, and the other panels show the bad exposures. The axes of the figures correspond to two dimensions of the t-SNE-reduced embeddings, and are not physically interpretable.
  • ...and 7 more figures