Table of Contents
Fetching ...

An updated efficient galaxy morphology classification model based on ConvNeXt encoding with UMAP dimensionality reduction

Guanwen Fang, Shiwei Zhu, Jun Xu, Shiying Lu, Chichun Zhou, Yao Dai, Zesen Lin, Xu Kong

TL;DR

This work addresses the scalable, unsupervised classification of galaxy morphologies in large surveys by updating the USmorph framework with a pre-trained ConvNeXt encoder and UMAP dimensionality reduction. The dual-stage approach yields 20 algorithmic clusters that are visually refined into five physical morphologies, classifying $50{,}056$ galaxies (about $51\%$ of the COSMOS sample) with significantly reduced computational cost. Validation against external catalogs (Galaxy Zoo:Hubble) and extensive structural parameter analysis demonstrate that the method captures expected morphology–structure correlations and offers robust, transferable classifications suitable for future surveys like CSST. The framework reduces reliance on labeled data, improves efficiency for cross-survey analyses, and provides a high-quality training subset for supervised or semi-supervised extensions.

Abstract

We present an enhanced unsupervised machine learning (UML) module within our previous \texttt{USmorph} classification framework featuring two components: (1) hierarchical feature extraction via a pre-trained ConvNeXt convolutional neural network (CNN) with transfer learning, and (2) nonlinear manifold learning using Uniform Manifold Approximation and Projection (UMAP) for topology-aware dimensionality reduction. This dual-stage design enables efficient knowledge transfer from large-scale visual datasets while preserving morphological pattern geometry through UMAP's neighborhood preservation. We apply the upgraded UML on I-band images of 99,806 COSMOS galaxies at redshift $0.2<z<1.2$ (to ensure rest-frame optical morphology) with $I_{\mathrm{mag}}<25$. The predefined cluster number is optimized to 20 (reduced from 50 in the original framework), achieving significant computational savings. The 20 algorithmically identified clusters are merged into five physical morphology types. About 51\% of galaxies (50,056) were successfully classified. To assess classification effectiveness, we tested morphological parameters for massive galaxies with $M_{*}>10^{9}~M_{\odot}$. Our classification results align well with galaxy evolution theory. This improved algorithm significantly enhances galaxy morphology classification efficiency, making it suitable for large-scale sky surveys such as those planned with the China Space Station Telescope (CSST).

An updated efficient galaxy morphology classification model based on ConvNeXt encoding with UMAP dimensionality reduction

TL;DR

This work addresses the scalable, unsupervised classification of galaxy morphologies in large surveys by updating the USmorph framework with a pre-trained ConvNeXt encoder and UMAP dimensionality reduction. The dual-stage approach yields 20 algorithmic clusters that are visually refined into five physical morphologies, classifying galaxies (about of the COSMOS sample) with significantly reduced computational cost. Validation against external catalogs (Galaxy Zoo:Hubble) and extensive structural parameter analysis demonstrate that the method captures expected morphology–structure correlations and offers robust, transferable classifications suitable for future surveys like CSST. The framework reduces reliance on labeled data, improves efficiency for cross-survey analyses, and provides a high-quality training subset for supervised or semi-supervised extensions.

Abstract

We present an enhanced unsupervised machine learning (UML) module within our previous \texttt{USmorph} classification framework featuring two components: (1) hierarchical feature extraction via a pre-trained ConvNeXt convolutional neural network (CNN) with transfer learning, and (2) nonlinear manifold learning using Uniform Manifold Approximation and Projection (UMAP) for topology-aware dimensionality reduction. This dual-stage design enables efficient knowledge transfer from large-scale visual datasets while preserving morphological pattern geometry through UMAP's neighborhood preservation. We apply the upgraded UML on I-band images of 99,806 COSMOS galaxies at redshift (to ensure rest-frame optical morphology) with . The predefined cluster number is optimized to 20 (reduced from 50 in the original framework), achieving significant computational savings. The 20 algorithmically identified clusters are merged into five physical morphology types. About 51\% of galaxies (50,056) were successfully classified. To assess classification effectiveness, we tested morphological parameters for massive galaxies with . Our classification results align well with galaxy evolution theory. This improved algorithm significantly enhances galaxy morphology classification efficiency, making it suitable for large-scale sky surveys such as those planned with the China Space Station Telescope (CSST).

Paper Structure

This paper contains 16 sections, 5 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Distribution of galaxies in the COSMOS field in the I-band magnitude-redshift plane. The corresponding number distributions along $I_{\rm mag}$ and redshift are displayed at the top and right corners, respectively. The sample of galaxies at $0.2<z<1.2$ with $I_{\rm mag}<25$ is shown in orange.
  • Figure 2: Six examples of the image pre-processing flow, corresponding to each set of images from left to right: original image, denoised image, and image after the polar coordinate transformation.
  • Figure 3: Framework of ConvNeXt model. The architecture of ConvNeXt employs a Transformer-inspired modular design, which hierarchically stacks standardized modules in a multi-level architecture. Inter-module communication and gradient propagation are facilitated via consistent architectural interfaces, ensuring efficient information flow and stable training dynamics. This design paradigm maintains structural simplicity while enhancing scalability, enabling systematic network expansion via modular composition.
  • Figure 4: Schematic diagram of UML clustering process, including to extract key features from image data using ConvNeXt in step (a); to reduce dimensionality and remove redundant image information via UMAP in step (b); to provide a detailed example of the feature extraction and dimensionality reduction process in step (c); and to adopt a voting based on Bagging clustering model mechanism in step (d).
  • Figure 5: The continued schematic diagram of the UML clustering process as in Figure \ref{['fig:4']}. Step (e) displays the Visual classification by randomly selecting 100 images from the 20 machine-learning clusters and visually classifying them into five types of galaxies, including SPH (spherical), ETD (early-type disk), LTD (late-type disk), IRR (irregular), and UNC (unclassified), respectively. Step (f) shows the UMAP visualization of clustering effects by analyzing the UMAP 2D projection of the five-class labels derived from the previous step. On the left, visualize the 2048-dimensional features of all samples extracted by the ConvNeXt model using UMAP; on the right, visualize the distribution of 300-dimensional features after UMAP-based dimensionality reduction.
  • ...and 9 more figures