Table of Contents
Fetching ...

MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

Sumin Ha, Jun Hyeong Kim, Yinhua Piao, Sun Kim

TL;DR

MV-CLAM introduces a cross modal framework that unifies multi view molecular representations into a textual space for LLM based understanding. It fuses 2D graph and 3D conformer encodings with a MQ-Former to produce universal query tokens that condition a language model for molecule captioning and retrieval. A novel multi token contrasting loss preserves fine grained molecular information across textual tokens, yielding improved retrieval and richer captions. Across PubChem324k, MV-CLAM achieves state of the art results and ablations confirm the gains from multi view fusion and token level alignment, signaling a scalable path for enhanced molecular reasoning in biomedical contexts.

Abstract

Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.git.

MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

TL;DR

MV-CLAM introduces a cross modal framework that unifies multi view molecular representations into a textual space for LLM based understanding. It fuses 2D graph and 3D conformer encodings with a MQ-Former to produce universal query tokens that condition a language model for molecule captioning and retrieval. A novel multi token contrasting loss preserves fine grained molecular information across textual tokens, yielding improved retrieval and richer captions. Across PubChem324k, MV-CLAM achieves state of the art results and ablations confirm the gains from multi view fusion and token level alignment, signaling a scalable path for enhanced molecular reasoning in biomedical contexts.

Abstract

Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.git.

Paper Structure

This paper contains 36 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Motivations of MV-CLAM. (A) Complementary molecular information captured by 2D and 3D representations, where 2D graph encodes edge connectivity, and 3D conformers captures spatial coordinate structures. (B) Inconsistent mappings between molecule (2D and 3D) and property tokens (e.g., 2D property token like solubility and 3D structural information like chiral 3-C) in distinct text spaces. (C) A unified alignment with a Multi-Querying Transformer (MQ-Former) allows all text tokens share a single text space.
  • Figure 2: Methods for molecular language modeling. (A) Contrastive learning aligns two modalities via a contrastive objective, excelling in retrieval but lacking generative capabilities. (B) The Q-Former framework uses learnable query tokens for caption generation but is limited to a single molecular representation. (C) MV-CLAM extends this by integrating multiple representations with modality-specific queries, enabling fine-grained knowledge integration.
  • Figure 3: Training scheme of MQ-Former. The proposed MQ-Former enhances molecular language modeling by incorporating multi-token contrasting and amplified molecule captioning losses to the prior multi-objective loss li2023blipli2024towardsliu2023molca. (1) The novel multi-token contrasting loss $\ell_{MTC}$ replaces conventional molecule-text contrastive learning, encouraging diverse query-token alignment. (2) The molecule captioning loss $\ell_{MCap}$ is amplified to improve text generation quality. The molecule-text matching loss $\ell_{MTM}$ remains unchanged.
  • Figure 4: Molecule-text similarity for query-token contrasting. (A) Previous approach compute coarse-level similarity between molecule queries and CLS text token. (B) We propose a new approach to compute token-level similarity between molecule queries and all text tokens, which preserves molecule query diverse information.
  • Figure 5: Comparison of Uni-modal Q-Former Ablation and Ours
  • ...and 6 more figures