Table of Contents
Fetching ...

Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs

Chengxin Hu, Hao Li, Yihe Yuan, Jing Li, Ivor Tsang

TL;DR

This work tackles how hierarchical molecular graph representations can be effectively integrated into multimodal LLMs. It introduces M$^3$LLM, a two-stage training framework with a hierarchical graph encoder that provides atom-, motif-, and graph-level features via BRICS segmentation and a virtual graph node, combined with self-supervised and contrastive learning to align graph features with text. The experiments show that preserving multi-level graph information is not always necessary; different tasks benefit from different levels, and a dynamic projector to fuse levels is key for maximizing performance. The findings highlight the need for improved graph-text alignment and adaptive fusion strategies, offering a path toward more capable multimodal LLMs for chemistry.

Abstract

Following the milestones in large language models (LLMs) and multimodal models, we have seen a surge in applying LLMs to biochemical tasks. Leveraging graph features and molecular text representations, LLMs can tackle various tasks, such as predicting chemical reaction outcomes and describing molecular properties. However, most current work overlooks the *multi-level nature* of the graph modality, even though different chemistry tasks may benefit from different feature levels. In this work, we first study the effect of feature granularity and reveal that even reducing all GNN-generated feature tokens to a single one does not significantly impact model performance. We then investigate the effect of various graph feature levels and demonstrate that both the quality of LLM-generated molecules and model performance across different tasks depend on different graph feature levels. Therefore, we conclude with two key insights: (1) current molecular-related multimodal LLMs lack a comprehensive understanding of graph features, and (2) static processing is not sufficient for hierarchical graph feature. We share our findings in detail, with the hope of paving the way for the community to develop more advanced multimodal LLMs for incorporating molecular graphs.

Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs

TL;DR

This work tackles how hierarchical molecular graph representations can be effectively integrated into multimodal LLMs. It introduces MLLM, a two-stage training framework with a hierarchical graph encoder that provides atom-, motif-, and graph-level features via BRICS segmentation and a virtual graph node, combined with self-supervised and contrastive learning to align graph features with text. The experiments show that preserving multi-level graph information is not always necessary; different tasks benefit from different levels, and a dynamic projector to fuse levels is key for maximizing performance. The findings highlight the need for improved graph-text alignment and adaptive fusion strategies, offering a path toward more capable multimodal LLMs for chemistry.

Abstract

Following the milestones in large language models (LLMs) and multimodal models, we have seen a surge in applying LLMs to biochemical tasks. Leveraging graph features and molecular text representations, LLMs can tackle various tasks, such as predicting chemical reaction outcomes and describing molecular properties. However, most current work overlooks the *multi-level nature* of the graph modality, even though different chemistry tasks may benefit from different feature levels. In this work, we first study the effect of feature granularity and reveal that even reducing all GNN-generated feature tokens to a single one does not significantly impact model performance. We then investigate the effect of various graph feature levels and demonstrate that both the quality of LLM-generated molecules and model performance across different tasks depend on different graph feature levels. Therefore, we conclude with two key insights: (1) current molecular-related multimodal LLMs lack a comprehensive understanding of graph features, and (2) static processing is not sufficient for hierarchical graph feature. We share our findings in detail, with the hope of paving the way for the community to develop more advanced multimodal LLMs for incorporating molecular graphs.

Paper Structure

This paper contains 38 sections, 12 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of M$^3$LLM Architecture. Left: A pipeline comprising GNN-based hierarchical graph dynamic segmentation, multi-level self-supervised learning, and multi-level contrastive learning. Right: A two-stage training pipeline of M$^3$LLM.