L+M-24: Building a Dataset for Language + Molecules @ ACL 2024
Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji
TL;DR
To address the scarcity of molecule-language datasets, this work introduces L+M-24, a large, template-driven dataset assembled from PubChem, CheF, and ChemFOnt for the ACL 2024 Language + Molecules workshop. It formalizes two translation tasks—captioning from a molecule and description-to-molecule generation—while emphasizing compositionality, abstraction, and functionality across four application domains. The dataset employs GPT-4 to generate 917 templates, yielding 160,492 training pairs and 21,839 evaluation pairs, with held-out property combinations to probe generalization; baseline experiments using MolT5 variants and Meditron-7B establish initial benchmarks under a range of metrics. By foregrounding compositionality and domain-specific semantics, L+M-24 aims to advance language-guided molecular design for applications in drug discovery, materials science, and green chemistry, and invites broader participation from the community.
Abstract
Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
