Table of Contents
Fetching ...

UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

Shuhan Guo, Yatao Bian, Ruibing Wang, Nan Yin, Zhen Wang, Quanming Yao

TL;DR

<3-5 sentence high-level summary> UniMoT tackles the challenge of unifying molecule and text modalities in large-language-model frameworks by introducing a tokenizer-based architecture that discretizes molecular information into tokens compatible with text tokens. A Vector Quantization-driven molecule tokenizer, coupled with a causal Q-Former, bridges modality gaps and enables a unified, autoregressive training paradigm for both molecule-to-text and text-to-molecule tasks. Through a four-stage training process, UniMoT expands the vocabulary with molecule tokens and leverages stage-wise pretraining and instruction tuning to achieve state-of-the-art results on a broad suite of molecule comprehension and generation tasks. This work demonstrates the viability and benefits of discrete latent representations for integrated molecular understanding and generation within LLMs, with potential impact on drug discovery and materials science workflows.

Abstract

The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.

UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

TL;DR

<3-5 sentence high-level summary> UniMoT tackles the challenge of unifying molecule and text modalities in large-language-model frameworks by introducing a tokenizer-based architecture that discretizes molecular information into tokens compatible with text tokens. A Vector Quantization-driven molecule tokenizer, coupled with a causal Q-Former, bridges modality gaps and enables a unified, autoregressive training paradigm for both molecule-to-text and text-to-molecule tasks. Through a four-stage training process, UniMoT expands the vocabulary with molecule tokens and leverages stage-wise pretraining and instruction tuning to achieve state-of-the-art results on a broad suite of molecule comprehension and generation tasks. This work demonstrates the viability and benefits of discrete latent representations for integrated molecular understanding and generation within LLMs, with potential impact on drug discovery and materials science workflows.

Abstract

The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
Paper Structure (49 sections, 7 equations, 4 figures, 20 tables)

This paper contains 49 sections, 7 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Comparisons among different molecular LLMs. \ref{['fig:intro-proj']} and \ref{['fig:intro-qformer']} are adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. \ref{['fig:intro-token']} is our proposed tokenizer-based architecture, where molecules are presented in the same discrete token representation as that of text.
  • Figure 2: Illustration of our proposed molecule tokenizer. The tokenizer generates discrete molecule tokens, which can be fed into LLMs for downstream tasks. The generated molecule tokens can be decoded into molecules using the adapter and the SMILES decoder during inference.
  • Figure 3: Illustration of the multi-modal autoregressive pretraining on molecule-text datasets. UniMoT excels in multi-modal comprehension and generation tasks, enabled by the unified LM objective. $T$ represents the size of the text vocabulary.
  • Figure 4: Illustration of our proposed Causal Q-Former. The Causal Q-Former provides causal queries for subsequent blocks.