Table of Contents
Fetching ...

MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data

Wentao Li, Yizhe Chen, Jiangjie Qiu, Xiaonan Wang

TL;DR

MatWheel tackles data scarcity in materials science by leveraging synthetic data from a conditional generative model to augment property-prediction training. The method combines CGCNN with Con-CDVAE and explores fully supervised and semi-supervised workflows, including KDE-based sampling for conditioning. Key findings show synthetic data can reach or surpass real-data performance in extreme scarcity, with pseudo-labels having limited impact on data quality. The work highlights limitations of current synthetic-data approaches and outlines directions to realize a robust data flywheel through advanced generative models like MatterGen and optimized generation strategies.

Abstract

Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.

MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data

TL;DR

MatWheel tackles data scarcity in materials science by leveraging synthetic data from a conditional generative model to augment property-prediction training. The method combines CGCNN with Con-CDVAE and explores fully supervised and semi-supervised workflows, including KDE-based sampling for conditioning. Key findings show synthetic data can reach or surpass real-data performance in extreme scarcity, with pseudo-labels having limited impact on data quality. The work highlights limitations of current synthetic-data approaches and outlines directions to realize a robust data flywheel through advanced generative models like MatterGen and optimized generation strategies.

Abstract

Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.

Paper Structure

This paper contains 4 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: MatWheel overall framework. In the semi-supervised case we divided into three stages of training and inference, in the full sample case we divided into two stages of training and inference