Optimizing Photometric Redshift Training Sets I: Efficient Compression of the Galaxy Color-Redshift Relation with UMAP
Finian Ashmead, Jeffrey A. Newman, Brett H. Andrews, Rachel Bezanson, Biprateep Dey, Daniel C. Masters, S. A. Stanford
TL;DR
The paper tackles the challenge that spectroscopic training samples are biased and sparse relative to the full photometric galaxy population, which degrades photo-$z$ performance in grid-based color–redshift mappings. It compares Self-Organizing Maps (SOM) with Uniform Manifold Approximation and Projection (UMAP) for compressing seven-color COSMOS2020 photometry into low-dimensional embeddings and demonstrates that UMAP yields a continuous manifold with monotonic redshift and sSFR variation. Redshift estimates derived from UMAP via a median-of-nearest-neighbors (UMAP-kNN-z) are consistently more accurate and less biased than SOM-based estimates, especially when training data are spectroscopically biased. The results indicate that representative training sets can be constructed by interpolating along the UMAP manifold, enabling robust photo-$z$ estimation at large survey scales and offering a practical path to mitigating spectroscopic biases in training data.
Abstract
Spectroscopic datasets are essential for training and calibrating photometric redshift (photo-$z$) methods. However, spectroscopic redshifts (spec-$z$'s) constitute a biased and sparse sampling of the photometric galaxy population, which creates difficulties for the common grid-based approach for mapping color to redshift using self-organizing maps (SOMs). Instead, we utilized the uniform manifold approximation and projection (UMAP) algorithm to compress a Rubin-Roman-like $ugrizyJH$ color space into a thin and densely-sampled manifold. Crucially, the manifold varies continuously and monotonically in redshift and specific star formation rate in roughly orthogonal directions. Using $\sim$110,000 COSMOS2020 many-band photo-$z$'s and $\sim$15,000 spec-$z$'s as representative and non-representative samples, respectively, we trained and tested redshift estimation from a SOM (SOM-$z$) and from nearest neighbors in UMAP space (UMAP-$k$NN-$z$). Compared to SOM-$z$, UMAP-$k$NN-$z$ exhibited smaller photo-$z$ scatter and fraction of outliers for the representative training set. When training with the highly biased spec-$z$ sample, UMAP-$k$NN-$z$ maintained similar performance, but the outlier fraction for SOM-$z$ increased by nearly three times. The physically-meaningful trends across the UMAP manifold allow for accurate redshift regression even in regions of color space sparsely populated by spectroscopic objects, which comprise nearly 25% of the photometric sample. This suggests that representative, spectroscopically-anchored training sets can be produced by interpolating between spectroscopic sources at the UMAP coordinates of photometric objects, maximizing the performance of photo-$z$ algorithms.
