Table of Contents
Fetching ...

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin

TL;DR

This work tackles the data scarcity in molecule-language modeling by introducing LA^3, an automated annotation-augmentation pipeline that rewrites molecular captions with large language models to create LaChEBI-20. Trained on this augmented dataset, the LaMolT5 model demonstrates strong gains in text-based de novo molecule generation and molecule captioning, achieving up to 301% improvement in generation and notable improvements in captioning while maintaining parameter efficiency. The approach also extends to broad applications in image, text, and graph tasks, suggesting wide utility beyond the molecular domain. Overall, LA^3 provides a scalable, automated route to richer, diverse annotations that enhance cross-modal molecular understanding and generation with practical impact for drug discovery and related AI tasks.

Abstract

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

TL;DR

This work tackles the data scarcity in molecule-language modeling by introducing LA^3, an automated annotation-augmentation pipeline that rewrites molecular captions with large language models to create LaChEBI-20. Trained on this augmented dataset, the LaMolT5 model demonstrates strong gains in text-based de novo molecule generation and molecule captioning, achieving up to 301% improvement in generation and notable improvements in captioning while maintaining parameter efficiency. The approach also extends to broad applications in image, text, and graph tasks, suggesting wide utility beyond the molecular domain. Overall, LA^3 provides a scalable, automated route to richer, diverse annotations that enhance cross-modal molecular understanding and generation with practical impact for drug discovery and related AI tasks.

Abstract

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Molecule generation performance of LaMolT5-Small with different LA$^3$ augmentations. Conventional augmentation (EDA WZ19, Mixup ZCDL18) and straightforward LLMs for data generation ZZM24 fall behind.
  • Figure 2: An example implementation of LA$^3$ for annotation augmentation (A) and training (B). Given molecules and their original annotations, we prompt LLMs to generate augmented annotations (LaChEBI-20) by rewriting the original annotations. Next, we train LaMolT5 on LaChEBI-20 to learn a mapping function between the molecule's SMILES string and corresponding annotations.
  • Figure 3: Performance vs. Number of parameters of LaMolT5 and top-$3$ leaderboard state-of-the-art methods. Overall rank: LaMolT5-Base (#1), LaMolT5-Large (#2) and BioT5 (#3).
  • Figure 4: Molecule generation performance of MolT5-Small and LaMolT5-Small with captions generated by open-sourced and closed-sources LLMs.