Table of Contents
Fetching ...

A review of unsupervised learning in astronomy

Sotiria Fotopoulou

TL;DR

This review synthesises how unsupervised learning has evolved in astronomy, highlighting core methods such as PCA/SVD, ICA, NMF, Isomap, LLE, tSNE, and UMAP, as well as clustering techniques like k‑means, GMMs, and DBSCAN/HDBSCAN. It emphasizes the shift from purely linear, dimensionality‑reduction approaches to nonlinear manifolds, neural network–based representations, and modern self‑supervised and domain‑adaptation strategies. The authors stress practical workflow considerations, data peculiarities (missing data, heterogeneity), and the importance of robust validation, benchmarks, and interpretability in data‑driven discovery. Overall, the paper argues for thoughtful integration of ML with domain knowledge to enable scalable, generalizable insights while avoiding overinterpretation in the face of complex, high‑dimensional astronomical data.

Abstract

This review summarizes popular unsupervised learning methods, and gives an overview of their past, current, and future uses in astronomy. Unsupervised learning aims to organise the information content of a dataset, in such a way that knowledge can be extracted. Traditionally this has been achieved through dimensionality reduction techniques that aid the ranking of a dataset, for example through principal component analysis or by using auto-encoders, or simpler visualisation of a high dimensional space, for example through the use of a self organising map. Other desirable properties of unsupervised learning include the identification of clusters, i.e. groups of similar objects, which has traditionally been achieved by the k-means algorithm and more recently through density-based clustering such as HDBSCAN. More recently, complex frameworks have emerged, that chain together dimensionality reduction and clustering methods. However, no dataset is fully unknown. Thus, nowadays a lot of research has been directed towards self-supervised and semi-supervised methods that stand to gain from both supervised and unsupervised learning.

A review of unsupervised learning in astronomy

TL;DR

This review synthesises how unsupervised learning has evolved in astronomy, highlighting core methods such as PCA/SVD, ICA, NMF, Isomap, LLE, tSNE, and UMAP, as well as clustering techniques like k‑means, GMMs, and DBSCAN/HDBSCAN. It emphasizes the shift from purely linear, dimensionality‑reduction approaches to nonlinear manifolds, neural network–based representations, and modern self‑supervised and domain‑adaptation strategies. The authors stress practical workflow considerations, data peculiarities (missing data, heterogeneity), and the importance of robust validation, benchmarks, and interpretability in data‑driven discovery. Overall, the paper argues for thoughtful integration of ML with domain knowledge to enable scalable, generalizable insights while avoiding overinterpretation in the face of complex, high‑dimensional astronomical data.

Abstract

This review summarizes popular unsupervised learning methods, and gives an overview of their past, current, and future uses in astronomy. Unsupervised learning aims to organise the information content of a dataset, in such a way that knowledge can be extracted. Traditionally this has been achieved through dimensionality reduction techniques that aid the ranking of a dataset, for example through principal component analysis or by using auto-encoders, or simpler visualisation of a high dimensional space, for example through the use of a self organising map. Other desirable properties of unsupervised learning include the identification of clusters, i.e. groups of similar objects, which has traditionally been achieved by the k-means algorithm and more recently through density-based clustering such as HDBSCAN. More recently, complex frameworks have emerged, that chain together dimensionality reduction and clustering methods. However, no dataset is fully unknown. Thus, nowadays a lot of research has been directed towards self-supervised and semi-supervised methods that stand to gain from both supervised and unsupervised learning.

Paper Structure

This paper contains 54 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Graphical overview of unsupervised learning in astronomy.
  • Figure 2: Dimensional reduction with (a) Isomap, which finds the geodesic of neighbouring points, and b) Locally-linear Embedding, which find neighbourhoods where linear approximation holds. Both methods as showcased on the 'Swiss roll' dataset, where the curve of the manifold could cause a 'short-circuit' if a large neighbourhood is chosen. Fig. (a) adopted from Tenenbaum2000-ISOMAP, fig. (b) adopted from Roweis2000-locally-linear-embedding.
  • Figure 3: UMAP and t-SNE projections of 10% subsample (red) and full (blue) flow cytometry dataset. If is visually clear that projection of new data on clustered spaced based on either of this projections leads to gross misalignment, here defined using Procrustes-based alignment. Figure adopted from McInnes2018_UMAP.
  • Figure 4: Autoencoder network, figure adopted from kramer1991_ae.
  • Figure 5: Complex frameworks are emerging in modern ML applications. Here we show one example of modelling the autoencoder latent space with a self-organising map, subsequently clustered with k-means. Figure adopted from Ralph2019PASP..131j8011R.
  • ...and 1 more figures