Table of Contents
Fetching ...

LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, Alexandre Duval

TL;DR

LeMat-GenBench tackles the lack of standardized evaluation for crystal-generative models by introducing a unified benchmark and open-source toolbox. It defines a comprehensive unconditional-generation metric suite (SUN/MSUN) anchored by a self-consistent MLIP-based convex hull and LeMat-Bulk as a broad reference. The paper benchmarks 12 state-of-the-art generative methods, revealing clear trade-offs between stability, novelty, and diversity, and showing no single approach dominates. It also establishes a public leaderboard and discusses design choices to improve reliability and future extensions toward conditional generation and synthesis-aware discovery.

Abstract

Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

TL;DR

LeMat-GenBench tackles the lack of standardized evaluation for crystal-generative models by introducing a unified benchmark and open-source toolbox. It defines a comprehensive unconditional-generation metric suite (SUN/MSUN) anchored by a self-consistent MLIP-based convex hull and LeMat-Bulk as a broad reference. The paper benchmarks 12 state-of-the-art generative methods, revealing clear trade-offs between stability, novelty, and diversity, and showing no single approach dominates. It also establishes a public leaderboard and discusses design choices to improve reliability and future extensions toward conditional generation and synthesis-aware discovery.

Abstract

Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

Paper Structure

This paper contains 38 sections, 6 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: LeMat-GenBench pipeline from raw model outputs to comprehensive evaluation. The framework begins by filtering generated crystals through rigorous validity checks. Valid structures are enriched with structural fingerprints, crystallographic descriptors, and MLIP-based energetic properties. Then, LeMat-GenBench computes a unified set of metrics capturing stability, novelty, uniqueness, diversity, distributional alignment, practical synthesizability considerations, and model efficiency. The resulting metric suite provides a standardized, leaderboard-ready assessment of generative models for inorganic crystal design.
  • Figure 2: Spider plots comparing generative models using LeMat-Bulk as reference. Left: (M)S.U.N. metrics measuring validity, uniqueness, novelty, (meta)stability, and the combined (M)S.U.N. score. Right: Quality metrics including energy above hull ($E_\mathrm{hull}$), structural relaxation (RMSD), diversity (average of elemental, space group, and size diversity normalized), JS divergence (Jensen-Shannon divergence measuring distributional similarity to reference). All metrics normalized to same scale where outer positions indicate better performance.
  • Figure 3: Validity comparison between SMACT and our proposed method. Five random samples of 1,000 structures from LeMat-Bulk were evaluated using both methods. Left: Overall validity rates with standard deviation across seeds. Right: Agreement breakdown. Our method recovers 12.6% of structures that SMACT rejects, primarily f-block and metalloid containing structures and elements with multiple oxidation states, while disagreeing on only 1.0% in the opposite direction. Seed-level results in \ref{['fig:seed_variation']}.
  • Figure 4: F1-score for stability prediction on LeMat-Bulk. Three MLIPs (orb-v3, uma-s1p1, mace-mp) are evaluated at different $E_{\text{hull}}$ thresholds. (a) MLIP energies compared against a DFT-constructed hull. (b) Self-consistent approach where each MLIP defines its own hull. The self-consistent method yields consistently higher F1-scores, particularly at strict thresholds.
  • Figure 5: An overview of the generative AI paradigm for candidate structure generation and optimization that underpins much of the work reviewed herein.
  • ...and 12 more figures