Table of Contents
Fetching ...

From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts

Lei Zhang, Markus Stricker

TL;DR

A label-free screening strategy is evaluated that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts, including conductivity and dielectric.

Abstract

Compositionally complex solid solution electrocatalysts span vast composition spaces, and even one materials system can contain more candidate compositions than can be measured exhaustively. Here we evaluate a label-free screening strategy that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts. We compare a corpus-trained Word2Vec baseline with transformer-based embeddings, where compositions are encoded either by linear element-wise mixing or by short composition prompts. Similarities to `concept directions', the terms conductivity and dielectric, define a 2-dimensional descriptor space, and a symmetric Pareto-front selection is used to filter candidate subsets without using electrochemical labels. Performance is assessed on 15 materials libraries including noble metal alloys and multicomponent oxides. In this setting, the lightweight Word2Vec baseline, which uses a simple linear combination of element embeddings, often achieves the highest number of reductions of possible candidate compositions while staying close to the best measured performance.

From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts

TL;DR

A label-free screening strategy is evaluated that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts, including conductivity and dielectric.

Abstract

Compositionally complex solid solution electrocatalysts span vast composition spaces, and even one materials system can contain more candidate compositions than can be measured exhaustively. Here we evaluate a label-free screening strategy that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts. We compare a corpus-trained Word2Vec baseline with transformer-based embeddings, where compositions are encoded either by linear element-wise mixing or by short composition prompts. Similarities to `concept directions', the terms conductivity and dielectric, define a 2-dimensional descriptor space, and a symmetric Pareto-front selection is used to filter candidate subsets without using electrochemical labels. Performance is assessed on 15 materials libraries including noble metal alloys and multicomponent oxides. In this setting, the lightweight Word2Vec baseline, which uses a simple linear combination of element embeddings, often achieves the highest number of reductions of possible candidate compositions while staying close to the best measured performance.
Paper Structure (18 sections, 4 equations, 3 figures, 2 tables)

This paper contains 18 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Heatmap of the relative error in the current density of the best-performing composition between the Pareto-selected subset and the all compositions of the respective materials library. Colour indicates the percentage deviation, with lighter shades for smaller errors and darker shades for larger errors.
  • Figure 2: Fraction of compositions retained in the Pareto-selected subset for each embedding method. Bars show the mean retained fraction across all material systems. The overlaid dots show the retained fraction for each individual material system; they are slightly shifted left/right only to prevent points from overlapping (the shift has no numerical meaning).
  • Figure 3: Trade-off between the fraction of candidates retained and the relative error in the current density of the best-performing composition for all material systems and embedding methods.