Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language
Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
TL;DR
This work tackles the data scarcity in molecule-language modeling by introducing LA^3, an automated annotation-augmentation pipeline that rewrites molecular captions with large language models to create LaChEBI-20. Trained on this augmented dataset, the LaMolT5 model demonstrates strong gains in text-based de novo molecule generation and molecule captioning, achieving up to 301% improvement in generation and notable improvements in captioning while maintaining parameter efficiency. The approach also extends to broad applications in image, text, and graph tasks, suggesting wide utility beyond the molecular domain. Overall, LA^3 provides a scalable, automated route to richer, diverse annotations that enhance cross-modal molecular understanding and generation with practical impact for drug discovery and related AI tasks.
Abstract
Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.
