MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

Sumin Ha; Jun Hyeong Kim; Yinhua Piao; Sun Kim

MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

Sumin Ha, Jun Hyeong Kim, Yinhua Piao, Sun Kim

TL;DR

MV-CLAM introduces a cross modal framework that unifies multi view molecular representations into a textual space for LLM based understanding. It fuses 2D graph and 3D conformer encodings with a MQ-Former to produce universal query tokens that condition a language model for molecule captioning and retrieval. A novel multi token contrasting loss preserves fine grained molecular information across textual tokens, yielding improved retrieval and richer captions. Across PubChem324k, MV-CLAM achieves state of the art results and ablations confirm the gains from multi view fusion and token level alignment, signaling a scalable path for enhanced molecular reasoning in biomedical contexts.

Abstract

Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.git.

MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

TL;DR

Abstract

MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)