Table of Contents
Fetching ...

Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis

Devi Dutta Biswajeet, Sara Kadkhodaei

TL;DR

This work tackles data scarcity in materials ML by mining graphene CVD literature and applying prompt-driven LLM data imputation and substrate featurization to create a heterogeneous yet usable dataset. By combining LLM-enhanced imputations with embedding-based substrate representations and discretization, a classical SVM classifier achieves substantial gains over baseline and LLM-only predictions, increasing binary accuracy from ~39% to ~65% and ternary accuracy from ~52% to ~72%. The study demonstrates that, in data-constrained settings, sophisticated data engineering—rather than mere fine-tuning of large language models—yields superior predictive performance and generalization. The proposed framework offers a broadly applicable approach to improve ML on scarce, inhomogeneous datasets in materials science, with potential applicability beyond graphene CVD to other synthesis and processing domains.

Abstract

Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.

Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis

TL;DR

This work tackles data scarcity in materials ML by mining graphene CVD literature and applying prompt-driven LLM data imputation and substrate featurization to create a heterogeneous yet usable dataset. By combining LLM-enhanced imputations with embedding-based substrate representations and discretization, a classical SVM classifier achieves substantial gains over baseline and LLM-only predictions, increasing binary accuracy from ~39% to ~65% and ternary accuracy from ~52% to ~72%. The study demonstrates that, in data-constrained settings, sophisticated data engineering—rather than mere fine-tuning of large language models—yields superior predictive performance and generalization. The proposed framework offers a broadly applicable approach to improve ML on scarce, inhomogeneous datasets in materials science, with potential applicability beyond graphene CVD to other synthesis and processing domains.

Abstract

Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.

Paper Structure

This paper contains 27 sections, 11 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Overview of the methods employed in this study to enhance predictive classification performance on a limited, heterogeneous dataset for graphene chemical vapor deposition growth.
  • Figure 2: The workflow of the prompt-based LLM imputation methodology developed in this study.
  • Figure 3: Mean absolute error bars of various imputation techniques used in this study.
  • Figure 4: Comparison of attribute distributions between the existing dataset and the imputed dataset across different imputation methods.
  • Figure 5: Comparison of imputation methods based on Jensen-Shannon Divergence (JSD) and Earth Mover's Distance (EMD) across attributes. The top row presents heat maps of different attributes across various imputation methods, while the bottom row depicts the average performance across attributes along with variability.
  • ...and 18 more figures