Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Seojin Kim; Jaehyun Nam; Sihyun Yu; Younghoon Shin; Jinwoo Shin

Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin

TL;DR

HI-Mol tackles data-efficient molecular generation by introducing hierarchical textual inversion that learns multi-level token embeddings from a small set of molecules. The approach uses a frozen large text-to-molecule model and embedding-interpolation sampling to generate novel molecules; the key contributions are the multi-level token scheme, unsupervised cluster assignment, and an interpolation sampling strategy that leverages hierarchical information. Empirical results on QM9 and MoleculeNet demonstrate substantial data efficiency, including substantially reduced data requirements (e.g., 50x less data on QM9) with competitive or superior metrics such as FCD and NSPDK, and improved performance in low-shot property prediction. This framework enables practical, data-efficient molecular generation and highlights the potential of combining hierarchical priors with large pre-trained language models in chemistry.

Abstract

Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. We propose to use multi-level embeddings to reflect such hierarchical features based on the adoption of the recent textual inversion technique in the visual domain, which achieves data-efficient image generation. Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution. We then generate molecules based on the interpolation of the multi-level token embeddings. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x less training data. We also show the effectiveness of molecules generated by HI-Mol in low-shot molecular property prediction.

Data-Efficient Molecular Generation with Hierarchical Textual Inversion

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 5 figures, 23 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 5 figures, 23 tables, 1 algorithm.

Introduction
Related Work
HI-Mol: Hierarchical Textual Inversion for Molecular Generation
Problem Description and Overview
Preliminary: Textual Inversion
Detailed Description of HI-Mol
Experiments
Experimental Setup
Main Results
Analysis
Conclusion
Method Details
Datasets
Evaluation Metrics
Baselines
...and 10 more sections

Figures (5)

Figure 1: Overview of HI-Mol framework. (1) Hierarchical textual inversion: we encode low-shot molecules into multi-level token embeddings. (2) Embedding interpolation-based sampling: we generate novel molecules using interpolation of low-level token embeddings.
Figure 2: Visualizations of molecules in two different clusters obtained from the unsupervised clustering objective with the intermediate tokens in Eq. (\ref{['eq:training']}) on the HIV dataset wu2018moleculenet.
Figure 3: Visualization of the generated molecules with the specific condition $\gamma$. The maximum PLogP among the training molecules is 4.52.
Figure 4: Visualizations of the generated molecules with $\gamma=50$. The maximum PLogP among the training molecules is 4.52.
Figure 10: Results of molecular property maximization task. We report the top-3 property scores denoted by 1st, 2nd, and 3rd. The baseline scores are drawn from ahn2022spanning.

Data-Efficient Molecular Generation with Hierarchical Textual Inversion

TL;DR

Abstract

Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)