Table of Contents
Fetching ...

Co-SOM: Co-training for photometric redshift estimation using Self-Organizing Maps

Alvaro Callejas-Tavera, Erik Molino-Minero-Re, Octavio Valenzuela

TL;DR

Co‑SOM tackles the limited availability of spectroscopic redshifts for photometric redshift estimation in large surveys by combining co‑training with Self‑Organizing Maps to exploit unlabeled photometry. It introduces a topology‑aware selection mechanism and aggregation scheme across multiple SOMs and regions, enabling robust pseudo‑label propagation. On SDSS‑DR18, using ~1% labeled data, the method achieves a negligible bias $\\Delta z$ and competitive dispersion $\\sigma_{zp}$, approaching LSST photo‑z targets, with further gains observed as more labeled data are used. The work highlights the potential of semi‑supervised SOM‑based approaches for scalable, accurate redshift estimation and sketches extensions to high redshift, full redshift distributions, and survey‑specific systematics.

Abstract

The upcoming galaxy large-scale surveys, such as the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), will generate photometry for billions of galaxies. The interpretation of large-scale weak lensing maps, as well as the estimation of galaxy clustering, requires reliable redshifts with high precision for multi-band photometry. However, obtaining spectroscopy for billions of galaxies is impractical and complex; therefore, having a sufficiently large number of galaxies with spectroscopic observations to train supervised algorithms for accurate redshift estimation is a significant challenge and an open research area. We propose a novel methodology called Co-SOM, based on Co-training and Self-Organizing Maps (SOM), integrating labeled (sources with spectroscopic redshifts) and unlabeled (sources with photometric observations only) data during the training process, through a selection method based on map topology (connectivity structure of the SOM lattice) to leverage the limited spectroscopy available for photo-z estimation. We utilized the magnitudes and colors of Sloan Digital Sky Survey data release 18 (SDSS-DR18) to analyze and evaluate the performance, varying the proportion of labeled data and adjusting the training parameters. For training sets of 1% of labeled data ($\approx 20{,}000$ galaxies) we achieved a performance of bias $Δz = 0.00007 \pm 0.00022$, precision $σ_{zp} = 0.00063 \pm 0.00032$, and outlier fraction $f_{\mathrm{out}} = 0.02083 \pm 0.00027$. Additionally, we conducted experiments varying the volume of labeled data, and the bias remains below $10^{-3}$, regardless of the size of the spectroscopic or photometric data. These low-redshift results demonstrate the potential of semi-supervised learning to address spectroscopic limitations in future photometric surveys.

Co-SOM: Co-training for photometric redshift estimation using Self-Organizing Maps

TL;DR

Co‑SOM tackles the limited availability of spectroscopic redshifts for photometric redshift estimation in large surveys by combining co‑training with Self‑Organizing Maps to exploit unlabeled photometry. It introduces a topology‑aware selection mechanism and aggregation scheme across multiple SOMs and regions, enabling robust pseudo‑label propagation. On SDSS‑DR18, using ~1% labeled data, the method achieves a negligible bias and competitive dispersion , approaching LSST photo‑z targets, with further gains observed as more labeled data are used. The work highlights the potential of semi‑supervised SOM‑based approaches for scalable, accurate redshift estimation and sketches extensions to high redshift, full redshift distributions, and survey‑specific systematics.

Abstract

The upcoming galaxy large-scale surveys, such as the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), will generate photometry for billions of galaxies. The interpretation of large-scale weak lensing maps, as well as the estimation of galaxy clustering, requires reliable redshifts with high precision for multi-band photometry. However, obtaining spectroscopy for billions of galaxies is impractical and complex; therefore, having a sufficiently large number of galaxies with spectroscopic observations to train supervised algorithms for accurate redshift estimation is a significant challenge and an open research area. We propose a novel methodology called Co-SOM, based on Co-training and Self-Organizing Maps (SOM), integrating labeled (sources with spectroscopic redshifts) and unlabeled (sources with photometric observations only) data during the training process, through a selection method based on map topology (connectivity structure of the SOM lattice) to leverage the limited spectroscopy available for photo-z estimation. We utilized the magnitudes and colors of Sloan Digital Sky Survey data release 18 (SDSS-DR18) to analyze and evaluate the performance, varying the proportion of labeled data and adjusting the training parameters. For training sets of 1% of labeled data ( galaxies) we achieved a performance of bias , precision , and outlier fraction . Additionally, we conducted experiments varying the volume of labeled data, and the bias remains below , regardless of the size of the spectroscopic or photometric data. These low-redshift results demonstrate the potential of semi-supervised learning to address spectroscopic limitations in future photometric surveys.

Paper Structure

This paper contains 18 sections, 18 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: The input layer consists of data inputs (magnitudes and colors), each represented by an $n$-dimensional feature vector. During training, the output layer is functionally equivalent to the processing layer, as both share the same topological structure of the self-organizing map. Each neuron in this layer has an $n$-dimensional weight vector, matching the input data dimensionality, and is assigned a fixed coordinate ($i,j$) on a two-dimensional grid. This topological structure enables the SOM to preserve the spatial relationships inherent in the input space.
  • Figure 2: An overview of the co-training process is as follows: both classifiers (C1 and C2) are trained using half of the labeled data set each. Subsequently, in the following iterations, both models are updated with pseudo-labeled data generated through the selection model from the unlabeled instances.
  • Figure 3: Outliers identified using the Local Outlier Factor (LOF) algorithm, with k = 10 neighbors.
  • Figure 4: The methodology workflow comprises three primary stages: stratified sampling, training regions, and the selection method of co-training. These stages are carefully designed to address the essential components of the methodology.
  • Figure 5: Stratified sampling creates sub-samples while preserving the proportion of the original population.
  • ...and 7 more figures