Co-SOM: Co-training for photometric redshift estimation using Self-Organizing Maps
Alvaro Callejas-Tavera, Erik Molino-Minero-Re, Octavio Valenzuela
TL;DR
Co‑SOM tackles the limited availability of spectroscopic redshifts for photometric redshift estimation in large surveys by combining co‑training with Self‑Organizing Maps to exploit unlabeled photometry. It introduces a topology‑aware selection mechanism and aggregation scheme across multiple SOMs and regions, enabling robust pseudo‑label propagation. On SDSS‑DR18, using ~1% labeled data, the method achieves a negligible bias $\\Delta z$ and competitive dispersion $\\sigma_{zp}$, approaching LSST photo‑z targets, with further gains observed as more labeled data are used. The work highlights the potential of semi‑supervised SOM‑based approaches for scalable, accurate redshift estimation and sketches extensions to high redshift, full redshift distributions, and survey‑specific systematics.
Abstract
The upcoming galaxy large-scale surveys, such as the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), will generate photometry for billions of galaxies. The interpretation of large-scale weak lensing maps, as well as the estimation of galaxy clustering, requires reliable redshifts with high precision for multi-band photometry. However, obtaining spectroscopy for billions of galaxies is impractical and complex; therefore, having a sufficiently large number of galaxies with spectroscopic observations to train supervised algorithms for accurate redshift estimation is a significant challenge and an open research area. We propose a novel methodology called Co-SOM, based on Co-training and Self-Organizing Maps (SOM), integrating labeled (sources with spectroscopic redshifts) and unlabeled (sources with photometric observations only) data during the training process, through a selection method based on map topology (connectivity structure of the SOM lattice) to leverage the limited spectroscopy available for photo-z estimation. We utilized the magnitudes and colors of Sloan Digital Sky Survey data release 18 (SDSS-DR18) to analyze and evaluate the performance, varying the proportion of labeled data and adjusting the training parameters. For training sets of 1% of labeled data ($\approx 20{,}000$ galaxies) we achieved a performance of bias $Δz = 0.00007 \pm 0.00022$, precision $σ_{zp} = 0.00063 \pm 0.00032$, and outlier fraction $f_{\mathrm{out}} = 0.02083 \pm 0.00027$. Additionally, we conducted experiments varying the volume of labeled data, and the bias remains below $10^{-3}$, regardless of the size of the spectroscopic or photometric data. These low-redshift results demonstrate the potential of semi-supervised learning to address spectroscopic limitations in future photometric surveys.
