Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

Cicero X. Lu; Tushar Mittal; Christine H. Chen; Alexis Y. Li; Kadin Worthen; B. A. Sargent; Carey M. Lisse; G. C. Sloan; Dean C. Hines; Dan M. Watson; Isabel Rebollido; Bin B. Ren; Joel D. Green

Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

Cicero X. Lu, Tushar Mittal, Christine H. Chen, Alexis Y. Li, Kadin Worthen, B. A. Sargent, Carey M. Lisse, G. C. Sloan, Dean C. Hines, Dan M. Watson, Isabel Rebollido, Bin B. Ren, Joel D. Green

TL;DR

The paper presents CLUES, a non-parametric, fully interpretable unsupervised framework for clustering Spitzer IRS debris-disk spectra to reveal mineralogical groupings. It combines a rigorous preprocessing pipeline (photosphere subtraction, emissivity normalization, continuum modeling, binning) with the Sequencer distance-based workflow, enabling MST and hierarchical clustering driven by a distance scale and the Earth-Mover Distance. The approach is demonstrated on a forsterite spectral library, a meteorite ensemble, and a debris-disk spectrum, laying groundwork for broader mineralogical demographics and follow-up studies with JWST and other observatories. By enabling objective, scalable extraction of end-member spectra, CLUES advances our understanding of debris-disk composition and its links to planetary formation processes, with potential applicability to protoplanetary disks and remote-sensing spectroscopy.

Abstract

Debris disks, which consist of dust, planetesimals, planets, and gas, offer a unique window into the mineralogical composition of their parent bodies, especially during the critical phase of terrestrial planet formation spanning 10 to a few hundred million years. Observations from the $\textit{Spitzer}$ Space Telescope have unveiled thousands of debris disks, yet systematic studies remain scarce, let alone those with unsupervised clustering techniques. This study introduces $\texttt{CLUES}$ (CLustering UnsupErvised with Sequencer), a novel, non-parametric, fully-interpretable machine-learning spectral analysis tool designed to analyze and classify the spectral data of debris disks. $\texttt{CLUES}$ combines multiple unsupervised clustering methods with multi-scale distance measures to discern new groupings and trends, offering insights into compositional diversity and geophysical processes within these disks. Our analysis allows us to explore a vast parameter space in debris disk mineralogy and also offers broader applications in fields such as protoplanetary disks and solar system objects. This paper details the methodology, implementation, and initial results of $\texttt{CLUES}$, setting the stage for more detailed follow-up studies focusing on debris disk mineralogy and demographics.

Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

TL;DR

Abstract

Space Telescope have unveiled thousands of debris disks, yet systematic studies remain scarce, let alone those with unsupervised clustering techniques. This study introduces

(CLustering UnsupErvised with Sequencer), a novel, non-parametric, fully-interpretable machine-learning spectral analysis tool designed to analyze and classify the spectral data of debris disks.

combines multiple unsupervised clustering methods with multi-scale distance measures to discern new groupings and trends, offering insights into compositional diversity and geophysical processes within these disks. Our analysis allows us to explore a vast parameter space in debris disk mineralogy and also offers broader applications in fields such as protoplanetary disks and solar system objects. This paper details the methodology, implementation, and initial results of

, setting the stage for more detailed follow-up studies focusing on debris disk mineralogy and demographics.

Paper Structure (12 sections, 3 equations, 6 figures)

This paper contains 12 sections, 3 equations, 6 figures.

Introduction
Spectral analysis workflow
Preprocessing Stage 1
Stellar Photosphere Modeling and Subtraction
Determining the Average Emissivity for the Small Dust Grain Population from IRS Spectra
Disk Continuum Modeling
Binning data in Spectra to improve SNR
Spectra Normalization
CLUES Analysis Workflow
Sequencer and its Associated Workflow
Selecting a Distance Metrics
Optimizing the Distance Scale

Figures (6)

Figure 1: Spectral Indices Band Locations with an example Spitzer IRS spectrum. The band positions of the strongest $10\mu$m band is plotted against the emissivity of band A ($8.9\,$--$\,9.6\,\mu$m), B ($9.8\,$--$\,10.2\,\mu$m), C ($10.8\,$--$\,11.4$$\,\mu$m) and D ($12.2\,$--$\,12.7$$\,\mu$m). The x-axis is wavelength in microns and y-axis is emissivity which is usually defined to be disk flux divided by fitted continuum flux from $8$--$13\,\mu$m Morlok+14.
Figure 2: A Flowchart of Data Processing Steps (P1): Each rectangular box represents a data processing step (in black fonts) and its corresponding subsections (in gray fonts) in the next section.
Figure 3: Visualization of Data Processing (Section \ref{['section:preprocessing']}) for an example debris disk spectra, HD 113766. Top: The original spectrum of HD 113766. We show the stellar photosphere (red dotted line), disk continuum (green dashed line), anchor points (black points) for fitting the disk continuum, and the sum of the stellar and disk flux contribution (orange solid line) as described in sections \ref{['StellarContinuum']} and \ref{['sub:diskContinuum']}. Middle: We present a spectrum in blue, with the stellar photosphere (red dotted line) and the debris disk continuum (green dashed line) subtracted. The red points denote where data points become negative in value after disk continuum flux subtraction as described in \ref{['StellarContinuum']}. We truncate the spectrum at wavelengths beyond 33 $\mu$m where the photosphere-subtraction and disk-continuum subtraction often result in negative data due to noises in Spitzer/IRS data. Bottom: We show an emissivity spectrum in blue as described in Section \ref{['sub:emissivity']}.
Figure 4: A Flowchart of Data Analyses Steps - CLUES: Each rectangular box represents a data processing step (in black fonts) and its corresponding subsections (in gray fonts). The top rectangular box displays our input forsterite library emissivity spectra. The subsequent step involves calculating the distance matrices using Sequencer. This distance matrix enables us to perform two separate analyses, as indicated by the bifurcated arrows in the next step. We have the option to calculate a minimum spanning tree by collapsing the distance matrix into a 1D sequence. Alternatively, hierarchical clustering algorithms can be utilized to classify compositionally-representative spectra. The next box connected to the "hierarchical clustering" rectangular box corresponds to the silhouette score analysis, which serves as a clustering criterion. By applying this criterion, we can generate our final outputs, groupings of spectra for further parametric modeling. Finally, various tools are employed to visualize the distance matrices for our dataset, facilitating the understanding of the correlation between any science target spectrum and external mineral library spectra.
Figure 5: Forsterite Emissivity Library. Left: Forsterite Emissivity plotted as a function of Fo number from Jena Database Chihara+02. Right: Forsterite Emissivity with $5$% random Gaussian noise.
...and 1 more figures

Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

TL;DR

Abstract

Sequencing Silicates in the IRS Debris Disk Catalog I: Methodology for Unsupervised Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)