Table of Contents
Fetching ...

A Comparative Study of Structural Representations for 2D Materials: Insights from Dynamic Collision Fingerprint and Matminer

Raphael M. Tromer, Isaac M. Felix, Rafael Besse, Marcelo L. Pereira Junior, Marcos G. E. da Luz

Abstract

In materials science, the selection of structural descriptors for machine learning protocols strongly influences predictive performance and the degree of physical interpretability that can be achieved from the derived models. Although more complex descriptors may improve numerical accuracy, they often represent extra computational load, also reducing transparency into the underlying structural information. A framework called the Dynamic Collision Fingerprint (DCF) was recently proposed with the goal of producing concise, physically significant representations, generating descriptors via dynamical probing of atomic structures. In this work, we benchmark DCF using a dataset composed of 120 two-dimensional carbon allotropes and compare its performance with the widely considered Matminer library. The analysis employs three regression models, linear regression, decision tree, and XGBoost, evaluated over train and test partitions ranging from 10\% to 90\% and repeated over multiple random seeds in order to characterize statistical variability. The obtained results demonstrate that DCF easily matches Matminer in terms of predicting accuracy across all learning algorithms. However, it accomplishes this using descriptors that are significantly lower dimensional, pointing to manageable computing costs. Moreover, compared to the rather technical Matminer descriptions, the DCF exhibits considerably clearer physical interpretability. These findings suggest that DCF is a significant substitute for high-dimensional descriptor libraries as structural representation since it is both computationally flexible and physically grounded.

A Comparative Study of Structural Representations for 2D Materials: Insights from Dynamic Collision Fingerprint and Matminer

Abstract

In materials science, the selection of structural descriptors for machine learning protocols strongly influences predictive performance and the degree of physical interpretability that can be achieved from the derived models. Although more complex descriptors may improve numerical accuracy, they often represent extra computational load, also reducing transparency into the underlying structural information. A framework called the Dynamic Collision Fingerprint (DCF) was recently proposed with the goal of producing concise, physically significant representations, generating descriptors via dynamical probing of atomic structures. In this work, we benchmark DCF using a dataset composed of 120 two-dimensional carbon allotropes and compare its performance with the widely considered Matminer library. The analysis employs three regression models, linear regression, decision tree, and XGBoost, evaluated over train and test partitions ranging from 10\% to 90\% and repeated over multiple random seeds in order to characterize statistical variability. The obtained results demonstrate that DCF easily matches Matminer in terms of predicting accuracy across all learning algorithms. However, it accomplishes this using descriptors that are significantly lower dimensional, pointing to manageable computing costs. Moreover, compared to the rather technical Matminer descriptions, the DCF exhibits considerably clearer physical interpretability. These findings suggest that DCF is a significant substitute for high-dimensional descriptor libraries as structural representation since it is both computationally flexible and physically grounded.
Paper Structure (4 sections, 4 figures, 1 table)

This paper contains 4 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Schematic workflow of the computational pipeline employed in this study. After dataset preparation, structural descriptors were generated using DCF and Matminer, and the models were trained using linear regression, decision trees, and XGBoost. Performance was assessed through MAE and $R^2$, complemented by paired statistical tests and correlation analyses.
  • Figure 2: MAE as a function of $X_\text{T}$ for different DCF parameterizations defined by $N_\text{S}$ and $N_\text{L}$ for (a) linear regression, (b) decision tree, and (c) XGBoost.
  • Figure 3: MAE as a function of $X_\text{T}$ comparing DCF and Matminer descriptors for (a) linear regression, (b) decision tree, and (c) XGBoost.
  • Figure 4: $R^2$ as a function of $X_\text{T}$ comparing DCF and Matminer descriptors for (a) linear regression, (b) decision tree, and (c) XGBoost.