Table of Contents
Fetching ...

Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding

Xinyu Li, Zhen Zhang, Qi Chen, Anton van den Hengel, Lina Yao, Javen Qinfeng Shi

Abstract

Existing chemical understanding tasks primarily rely on static molecular representations, limiting their ability to model inherently dynamic phenomena such as bond breaking or conformational changes, which are essential for a chemist to understand chemical reactions. To address this gap, we introduce Chemical Dynamics Understanding (ChemDU), a new task that translates 4D molecular trajectories into interpretable natural-language explanations. ChemDU focuses on fundamental dynamic scenarios, including gas-phase and catalytic reactions, and requires models to reason about key events along molecular trajectories, such as bond formation and dissociation, and to generate coherent, mechanistically grounded narratives. To benchmark this capability, we construct Chem4DBench, the first dataset pairing 4D molecular trajectories with expert-authored explanations across these settings. We further propose Chem4DLLM, a unified model that integrates an equivariant graph encoder with a pretrained large language model to explicitly capture molecular geometry and rotational dynamics. We hope that ChemDU, together with Chem4DBench and Chem4DLLM, will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.

Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding

Abstract

Existing chemical understanding tasks primarily rely on static molecular representations, limiting their ability to model inherently dynamic phenomena such as bond breaking or conformational changes, which are essential for a chemist to understand chemical reactions. To address this gap, we introduce Chemical Dynamics Understanding (ChemDU), a new task that translates 4D molecular trajectories into interpretable natural-language explanations. ChemDU focuses on fundamental dynamic scenarios, including gas-phase and catalytic reactions, and requires models to reason about key events along molecular trajectories, such as bond formation and dissociation, and to generate coherent, mechanistically grounded narratives. To benchmark this capability, we construct Chem4DBench, the first dataset pairing 4D molecular trajectories with expert-authored explanations across these settings. We further propose Chem4DLLM, a unified model that integrates an equivariant graph encoder with a pretrained large language model to explicitly capture molecular geometry and rotational dynamics. We hope that ChemDU, together with Chem4DBench and Chem4DLLM, will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.
Paper Structure (26 sections, 5 equations, 7 figures, 3 tables)

This paper contains 26 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison between static 3D molecular captioning and our proposed 4D molecular understanding task. (Top) Previous 3D methods utilize a static point cloud input $N \times 3$ to identify the molecule, such as Cyclohex-2-enone. (Bottom) Our 4D approach processes a temporal sequence of point clouds $T \times N \times 3$, enabling the model to describe dynamic chemical events. In this example, the model correctly identifies the trajectory of a C-O bond breaking within the Cyclohex-2-enone molecule, noting that the process initiates at $t=3$ and completes by $t=5$.
  • Figure 2: Overview of the Chem4D benchmark. This benchmark is a suit which encompasses three distinct categories: (1) Reaction Product Prediction, involving the analysis of bond breaking/forming events and reaction barriers (derived from Transition1x and RGD1); (2) Catalytic Reaction, covering complex surface interactions such as desorption (derived from OC20). For each category, the figure illustrates the workflow from the user query and 4D input, through the identification of key physical events, to the generation of the final scientific narrative.
  • Figure 3: The Chem4DLLM model architecture. (1) A 4D equivariant graph encoder (UMA) processes each 3D frame $\mathcal{X}_t$ into a graph embedding; (2) A projector transforms the graph embeddings into vectors that are additively fused with the embeddings of the corresponding special <graph> tokens; (3) The language model (Qwen3-8B) takes the resulting embedding sequence $\mathbf{E}$ as a prefix and autoregressively generates the output.
  • Figure 4: Statistical distribution of Reaction Product Prediction in the Chem4D benchmark. (a) The distribution of the number of atoms. (b) The distribution of the reaction barrier (eV). (c) The distribution of the reaction enthalpy (eV).
  • Figure 5: Statistical distribution of Catalytic Reaction Understanding in the Chem4D benchmark. (a) The distribution of the number of atoms. (b) The distribution of the reaction barrier (eV). (c) The distribution of the reaction enthalpy (eV).
  • ...and 2 more figures