CataLM: Empowering Catalyst Design Through Large Language Models

Ludi Wang; Xueqing Chen; Yi Du; Yuanchun Zhou; Yang Gao; Wenjuan Cui

CataLM: Empowering Catalyst Design Through Large Language Models

Ludi Wang, Xueqing Chen, Yi Du, Yuanchun Zhou, Yang Gao, Wenjuan Cui

TL;DR

CataLM addresses the need for catalyst-domain AI tools by building a domain-specific LLM for electrocatalytic materials. It employs domain pre-training on a large corpus of open-access electrocatalysis literature and instruction tuning with expert-annotated data, augmented by retrieval-augmentation to support precise knowledge extraction and design tasks. The model demonstrates competitive performance on named-entity recognition and catalyst control-method recommendations, with expert validation suggesting improved, domain-informed outputs over generic LLMs. Open-source release and planned downstream platforms aim to accelerate human–AI collaboration in catalyst discovery and development.

Abstract

The field of catalysis holds paramount importance in shaping the trajectory of sustainable development, prompting intensive research efforts to leverage artificial intelligence (AI) in catalyst design. Presently, the fine-tuning of open-source large language models (LLMs) has yielded significant breakthroughs across various domains such as biology and healthcare. Drawing inspiration from these advancements, we introduce CataLM Cata}lytic Language Model), a large language model tailored to the domain of electrocatalytic materials. Our findings demonstrate that CataLM exhibits remarkable potential for facilitating human-AI collaboration in catalyst knowledge exploration and design. To the best of our knowledge, CataLM stands as the pioneering LLM dedicated to the catalyst domain, offering novel avenues for catalyst discovery and development.

CataLM: Empowering Catalyst Design Through Large Language Models

TL;DR

Abstract

Paper Structure (11 sections, 4 figures, 4 tables)

This paper contains 11 sections, 4 figures, 4 tables.

Introduction
Related Work
CataLM
Domain Pre-training
Instruction Tuning
Training process
Evaluation
Named Entity Recognition Task
Control Method Recommendation Task
Conclusion
Competing Interests

Figures (4)

Figure 1: The training pipeline of CataLM. The bottom part illustrates the primary training pipeline of CataLM, while the top part of the figure delineates the entire data preparation process for training.
Figure 2: Catalytic Material Recommended Scenario's Command Format.
Figure 3: Prompt in the named entity recognition task.
Figure 4: Answer from CataLM and original LLM.

CataLM: Empowering Catalyst Design Through Large Language Models

TL;DR

Abstract

CataLM: Empowering Catalyst Design Through Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)