Table of Contents
Fetching ...

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

Anurag Acharya, Shivam Sharma, Robin Cosbey, Megha Subramanian, Scott Howland, Maria Glenski

TL;DR

The results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.

Abstract

A proliferation of Large Language Models (the GPT series, BLOOM, LLaMA, and more) are driving forward novel development of multipurpose AI for a variety of tasks, particularly natural language processing (NLP) tasks. These models demonstrate strong performance on a range of tasks; however, there has been evidence of brittleness when applied to more niche or narrow domains where hallucinations or fluent but incorrect responses reduce performance. Given the complex nature of scientific domains, it is prudent to investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains. In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting. Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

TL;DR

The results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.

Abstract

A proliferation of Large Language Models (the GPT series, BLOOM, LLaMA, and more) are driving forward novel development of multipurpose AI for a variety of tasks, particularly natural language processing (NLP) tasks. These models demonstrate strong performance on a range of tasks; however, there has been evidence of brittleness when applied to more niche or narrow domains where hallucinations or fluent but incorrect responses reduce performance. Given the complex nature of scientific domains, it is prudent to investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains. In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting. Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.

Paper Structure

This paper contains 15 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Model performance via accuracy in zero-shot (above) and few-shot (below) evaluations across varying subsets of the MMLU benchmark. A ‡ denotes both AISLE models outperform baselines across zero and few shot evaluations and † indicates at least one AISLE model outperforms both baselines.
  • Figure 2: Relative improvement in accuracy, ranging from 10%-50%, achieved by the AISLE$_{GPT2}$ model over baselines for zero-shot (0) and few-shot using 3 examples (3).
  • Figure 3: Bar plots show the number of examples with each edit distance per model. Bump charts rank the models based on the number of examples with each edit distance -- we ideally want to see a rank of 1 when edit distance is lower and a rank of 4 when edit distance is higher.