Table of Contents
Fetching ...

Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

Yikun Zhang, Geyan Ye, Chaohao Yuan, Bo Han, Long-Kai Huang, Jianhua Yao, Wei Liu, Yu Rong

TL;DR

Atomas addresses the limitation of global molecule-text alignment by introducing Hierarchical Adaptive Alignment that operates across atom, fragment, and molecule levels using Adaptive Polymerization Module and Weighted Alignment Module within a unified SMILES-text encoder. It jointly optimizes global alignment, hierarchical alignment, and generation objectives using $\mathcal{L}_{ga}$, $\mathcal{L}_{haa}$, and $\mathcal{L}_{lm}$, enabling both understanding and generation in an end-to-end framework. The approach achieves state-of-the-art performance across 12 tasks on 11 datasets, with strong scalability and qualitative validation, and demonstrates improved fine-grained control over molecular generation without requiring explicit local annotations. This work advances practical molecular understanding and design by enabling accurate retrieval, captioning, and text-driven molecule generation in data-scarce settings through a unified encoding and hierarchical cross-modal learning paradigm.

Abstract

Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields. However, most approaches employ a global alignment approach to learn the knowledge from different modalities that may fail to capture fine-grained information, such as molecule-and-text fragments and stereoisomeric nuances, which is crucial for downstream tasks. Furthermore, it is incapable of modeling such information using a similar global alignment strategy due to the lack of annotations about the fine-grained fragments in the existing dataset. In this paper, we propose Atomas, a hierarchical molecular representation learning framework that jointly learns representations from SMILES strings and text. We design a Hierarchical Adaptive Alignment model to automatically learn the fine-grained fragment correspondence between two modalities and align these representations at three semantic levels. Atomas's end-to-end training framework supports understanding and generating molecules, enabling a wider range of downstream tasks. Atomas achieves superior performance across 12 tasks on 11 datasets, outperforming 11 baseline models thus highlighting the effectiveness and versatility of our method. Scaling experiments further demonstrate Atomas's robustness and scalability. Moreover, visualization and qualitative analysis, validated by human experts, confirm the chemical relevance of our approach. Codes are released on https://github.com/yikunpku/Atomas.

Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

TL;DR

Atomas addresses the limitation of global molecule-text alignment by introducing Hierarchical Adaptive Alignment that operates across atom, fragment, and molecule levels using Adaptive Polymerization Module and Weighted Alignment Module within a unified SMILES-text encoder. It jointly optimizes global alignment, hierarchical alignment, and generation objectives using , , and , enabling both understanding and generation in an end-to-end framework. The approach achieves state-of-the-art performance across 12 tasks on 11 datasets, with strong scalability and qualitative validation, and demonstrates improved fine-grained control over molecular generation without requiring explicit local annotations. This work advances practical molecular understanding and design by enabling accurate retrieval, captioning, and text-driven molecule generation in data-scarce settings through a unified encoding and hierarchical cross-modal learning paradigm.

Abstract

Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields. However, most approaches employ a global alignment approach to learn the knowledge from different modalities that may fail to capture fine-grained information, such as molecule-and-text fragments and stereoisomeric nuances, which is crucial for downstream tasks. Furthermore, it is incapable of modeling such information using a similar global alignment strategy due to the lack of annotations about the fine-grained fragments in the existing dataset. In this paper, we propose Atomas, a hierarchical molecular representation learning framework that jointly learns representations from SMILES strings and text. We design a Hierarchical Adaptive Alignment model to automatically learn the fine-grained fragment correspondence between two modalities and align these representations at three semantic levels. Atomas's end-to-end training framework supports understanding and generating molecules, enabling a wider range of downstream tasks. Atomas achieves superior performance across 12 tasks on 11 datasets, outperforming 11 baseline models thus highlighting the effectiveness and versatility of our method. Scaling experiments further demonstrate Atomas's robustness and scalability. Moreover, visualization and qualitative analysis, validated by human experts, confirm the chemical relevance of our approach. Codes are released on https://github.com/yikunpku/Atomas.
Paper Structure (48 sections, 13 equations, 12 figures, 22 tables, 1 algorithm)

This paper contains 48 sections, 13 equations, 12 figures, 22 tables, 1 algorithm.

Figures (12)

  • Figure 1: Atomas is a hierarchical, end-to-end model designed to discover and automatically align local substructures of input while performing conditional generation. The learned cross-modal representations can be adapted to both understanding tasks (retrieval tasks) and generation tasks.
  • Figure 2: Illustration of the proposed Atomas. Atomas is composed of four components. (1) Unified Encoder encodes both the input molecule and its corresponding textual description. (2) Global Alignment module projects and aligns the global features of the molecule and text. A momentum model is used to ensure alignment consistency. (3) Hierarchical Adaptive Alignment aligns the molecule and text at three levels, including the Adaptive Polymerization module which clusters the original token features into distinct representation sets, and the Weighted Alignment module which aligns two modalities in a set-wise manner. (4) Conditional Decoder takes the molecule and text embedding as input and generates the target modality.
  • Figure 3: Unified encoder vs separate encoder with the scaling dataset. Evaluate on molecule generation task.
  • Figure 4: Ablation study for the effectiveness of joint optimization (left) and hierarchical alignment level numbers (right).
  • Figure 5: The visualization of adaptive polymerization module. The process of atom (word) polymerization to form individual sets is illustrated at three levels, including the reference diagram, from left to right. Atoms (words) belonging to the same set are highlighted using the same color.
  • ...and 7 more figures