L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Carl Edwards; Qingyun Wang; Lawrence Zhao; Heng Ji

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji

TL;DR

To address the scarcity of molecule-language datasets, this work introduces L+M-24, a large, template-driven dataset assembled from PubChem, CheF, and ChemFOnt for the ACL 2024 Language + Molecules workshop. It formalizes two translation tasks—captioning from a molecule and description-to-molecule generation—while emphasizing compositionality, abstraction, and functionality across four application domains. The dataset employs GPT-4 to generate 917 templates, yielding 160,492 training pairs and 21,839 evaluation pairs, with held-out property combinations to probe generalization; baseline experiments using MolT5 variants and Meditron-7B establish initial benchmarks under a range of metrics. By foregrounding compositionality and domain-specific semantics, L+M-24 aims to advance language-guided molecular design for applications in drug discovery, materials science, and green chemistry, and invites broader participation from the community.

Abstract

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

TL;DR

Abstract

dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular,

is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

Paper Structure (17 sections, 4 figures, 7 tables)

This paper contains 17 sections, 4 figures, 7 tables.

Introduction
Task Formulation
Designing for Compositionality, Abstraction, and Function
Data Sources
PubChem
Chemical Function (CheF)
ChemFOnt: the chemical functional ontology resource
Dataset Details
Template Generation
Converting Templates to Descriptions
Splitting
Evaluation Metrics
Benchmarks
Future Directions
Conclusion
...and 2 more sections

Figures (4)

Figure 1: Example descriptions created for molecules from the training set.
Figure 2: Examples of molecules generated by different models for never-before-seen property combinations.
Figure 3: Examples of molecules generated by different models for never-before-seen property combinations.
Figure 4: Breakdown of different property classes in $\text{L+M-24}$.

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

TL;DR

Abstract

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Authors

TL;DR

Abstract

Table of Contents

Figures (4)