Table of Contents
Fetching ...

Optimizing Photometric Redshift Training Sets I: Efficient Compression of the Galaxy Color-Redshift Relation with UMAP

Finian Ashmead, Jeffrey A. Newman, Brett H. Andrews, Rachel Bezanson, Biprateep Dey, Daniel C. Masters, S. A. Stanford

TL;DR

The paper tackles the challenge that spectroscopic training samples are biased and sparse relative to the full photometric galaxy population, which degrades photo-$z$ performance in grid-based color–redshift mappings. It compares Self-Organizing Maps (SOM) with Uniform Manifold Approximation and Projection (UMAP) for compressing seven-color COSMOS2020 photometry into low-dimensional embeddings and demonstrates that UMAP yields a continuous manifold with monotonic redshift and sSFR variation. Redshift estimates derived from UMAP via a median-of-nearest-neighbors (UMAP-kNN-z) are consistently more accurate and less biased than SOM-based estimates, especially when training data are spectroscopically biased. The results indicate that representative training sets can be constructed by interpolating along the UMAP manifold, enabling robust photo-$z$ estimation at large survey scales and offering a practical path to mitigating spectroscopic biases in training data.

Abstract

Spectroscopic datasets are essential for training and calibrating photometric redshift (photo-$z$) methods. However, spectroscopic redshifts (spec-$z$'s) constitute a biased and sparse sampling of the photometric galaxy population, which creates difficulties for the common grid-based approach for mapping color to redshift using self-organizing maps (SOMs). Instead, we utilized the uniform manifold approximation and projection (UMAP) algorithm to compress a Rubin-Roman-like $ugrizyJH$ color space into a thin and densely-sampled manifold. Crucially, the manifold varies continuously and monotonically in redshift and specific star formation rate in roughly orthogonal directions. Using $\sim$110,000 COSMOS2020 many-band photo-$z$'s and $\sim$15,000 spec-$z$'s as representative and non-representative samples, respectively, we trained and tested redshift estimation from a SOM (SOM-$z$) and from nearest neighbors in UMAP space (UMAP-$k$NN-$z$). Compared to SOM-$z$, UMAP-$k$NN-$z$ exhibited smaller photo-$z$ scatter and fraction of outliers for the representative training set. When training with the highly biased spec-$z$ sample, UMAP-$k$NN-$z$ maintained similar performance, but the outlier fraction for SOM-$z$ increased by nearly three times. The physically-meaningful trends across the UMAP manifold allow for accurate redshift regression even in regions of color space sparsely populated by spectroscopic objects, which comprise nearly 25% of the photometric sample. This suggests that representative, spectroscopically-anchored training sets can be produced by interpolating between spectroscopic sources at the UMAP coordinates of photometric objects, maximizing the performance of photo-$z$ algorithms.

Optimizing Photometric Redshift Training Sets I: Efficient Compression of the Galaxy Color-Redshift Relation with UMAP

TL;DR

The paper tackles the challenge that spectroscopic training samples are biased and sparse relative to the full photometric galaxy population, which degrades photo- performance in grid-based color–redshift mappings. It compares Self-Organizing Maps (SOM) with Uniform Manifold Approximation and Projection (UMAP) for compressing seven-color COSMOS2020 photometry into low-dimensional embeddings and demonstrates that UMAP yields a continuous manifold with monotonic redshift and sSFR variation. Redshift estimates derived from UMAP via a median-of-nearest-neighbors (UMAP-kNN-z) are consistently more accurate and less biased than SOM-based estimates, especially when training data are spectroscopically biased. The results indicate that representative training sets can be constructed by interpolating along the UMAP manifold, enabling robust photo- estimation at large survey scales and offering a practical path to mitigating spectroscopic biases in training data.

Abstract

Spectroscopic datasets are essential for training and calibrating photometric redshift (photo-) methods. However, spectroscopic redshifts (spec-'s) constitute a biased and sparse sampling of the photometric galaxy population, which creates difficulties for the common grid-based approach for mapping color to redshift using self-organizing maps (SOMs). Instead, we utilized the uniform manifold approximation and projection (UMAP) algorithm to compress a Rubin-Roman-like color space into a thin and densely-sampled manifold. Crucially, the manifold varies continuously and monotonically in redshift and specific star formation rate in roughly orthogonal directions. Using 110,000 COSMOS2020 many-band photo-'s and 15,000 spec-'s as representative and non-representative samples, respectively, we trained and tested redshift estimation from a SOM (SOM-) and from nearest neighbors in UMAP space (UMAP-NN-). Compared to SOM-, UMAP-NN- exhibited smaller photo- scatter and fraction of outliers for the representative training set. When training with the highly biased spec- sample, UMAP-NN- maintained similar performance, but the outlier fraction for SOM- increased by nearly three times. The physically-meaningful trends across the UMAP manifold allow for accurate redshift regression even in regions of color space sparsely populated by spectroscopic objects, which comprise nearly 25% of the photometric sample. This suggests that representative, spectroscopically-anchored training sets can be produced by interpolating between spectroscopic sources at the UMAP coordinates of photometric objects, maximizing the performance of photo- algorithms.

Paper Structure

This paper contains 16 sections, 5 figures.

Figures (5)

  • Figure 1: The distributions of our photometric (orange) and spectroscopic (spec-$z$ CL $>95$; blue) samples in redshift (upper left), $g-z$ color (all redshifts; upper right), $i$-band magnitude (lower left), and $z-y$ color for objects at redshifts $1.4<z<1.5$ (lower right). Higher redshifts, particularly at $z>1.5$, are underrepresented in the spectroscopic sample, as are objects fainter than $i\sim23$. The $g-z$ color distribution is redder overall for the spectroscopic sample, while the $z-y$ color at $1.4<z<1.5$ is systematically bluer for the spectroscopic sample. In the case of the $g-z$ distribution, the apparent overabundance of spectroscopic sources at $g-z\gtrsim1.5$ reflects redder galaxies at $z<1$, while in the case of the $z-y$ distribution for $1.4<z<1.5$, the bluer colors of the spectroscopic sample reflect a preference for star forming galaxies. Offset color distributions at fixed redshift such as that shown in the lower right will bias color--redshift mappings based on non-representative spectroscopic samples. The spectroscopic sample's biased coverage of color space can also be seen in the bottom row of Figure \ref{['fig:som_50_100']}.
  • Figure 2: A self-organizing map (SOM) partitioning the $ugrizyJH$ color space, color-coded by galaxy count (labeled $N$; left column), median redshift ($z$; center column), and median LePHARE specific star formation rate (sSFR; right column). The top row shows the full sample (112,409 sources) with the middle panel color-coded by LePHARE-$z$. The middle row shows a random sample of photometric objects of the same size as the spec-$z$ sample (14,499 sources) also color-coded by LePHARE-$z$. The bottom row shows the spec-$z$ CL $>95$ sample (14,499 sources), with the spec-$z$ coloring the middle panel. The middle row represents an ideally distributed training data case; it is representative of the full sample and gaps in the coverage of the color space are small and distributed more or less evenly. In contrast, the spectroscopic data (bottom row) illustrates the biased and incomplete coverage of color--redshift space that we aim to mitigate with UMAP-based techniques.
  • Figure 3: A three-dimensional UMAP embedding of the $ugrizyJH$ color space for the full sample of 112,409 COSMOS2020 sources. The points are color-coded by LePHARE photo-$z$ (top) and sSFR (bottom). Objects lie on a nearly two-dimensional manifold in this space, exhibiting continuous and monotonic trends in redshift and sSFR in roughly orthogonal directions. This figure shows selected frames from the thirteen-second animation available at https://finianashmead.github.io/#umap-cosmos2020-video, in which the plots are rotating about the vertical axes to better illustrate the three-dimensional structure.
  • Figure 4: A three-dimensional UMAP embedding of the $ugrizyJH$ color space showing only the 14,499 sources with high-confidence spec-$z$'s. The points are color-coded by spec-$z$ (top) and LePHARE sSFR (bottom). Compared to the photometric objects (see Figure \ref{['fig:umap_phot']}), the spectroscopic objects provide a more sparse and biased sampling of the manifold. However, unlike the SOM in Figure \ref{['fig:som_50_100']}, the trends in redshift and sSFR are continuous across the manifold, enabling accurate interpolation between the spectroscopic objects. This figure shows selected frames from the thirteen-second animation available at https://finianashmead.github.io/#umap-cosmos2020-video, in which the plots are rotating about the vertical axes to better illustrate the three-dimensional structure.
  • Figure 5: Photo-$z$ point estimate metrics $f_\mathrm{outlier}$ (top), $\sigma_\mathrm{NMAD}$ (middle), and bias (bottom) calculated in $\Delta z=0.2$ redshift bins for both SOM-$z$ (gray) and UMAP-$k$NN-$z$ (blue), when training with a random sample of COSMOS2020 LePHARE photo-$z$'s (left) and the spec-$z$ CL $>95$ sample of spec-$z$'s (right). For SOM, we compute these metrics (i) only for objects in cells with training data (light gray; i.e. ignoring the regions of color space not represented in the training data) and (ii) assuming that objects in the SOM cells lacking training redshifts are outliers (dark gray; 6,870 sources or 7% of the test sample for LePHARE-trained and 22,958 sources or 23% of the test sample for spec-$z$-trained). For UMAP, all metrics are computed for the full test samples of 97,910 objects in both training cases. UMAP-$k$NN-$z$ yields a lower $f_\mathrm{outlier}$ at all redshifts, especially for the more realistic spec-$z$-trained case. UMAP-$k$NN-$z$ also exhibits lower $\sigma_\mathrm{NMAD}$ and bias across much of the redshift range, quite dramatically in the spec-$z$-trained case at $1.8<z<2.0$.