Table of Contents
Fetching ...

ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry

Zhiyuan Huang, Baichuan Yang, Zikun He, Yanhong Wu, Fang Hongyu, Zhenhe Liu, Lin Dongsheng, Bing Su

TL;DR

ChemVTS-Bench introduces a domain-authentic benchmark to rigorously evaluate Visual-Textual-Symbolic reasoning in chemistry across three input modalities (image, image–text, SMILES). By covering organic, inorganic, and 3D crystal structures and employing an automated two-stage agent-based evaluation, it enables fine-grained analysis of modality-dependent reasoning and cross-modal integration. Key findings show that structural chemistry is the hardest domain, visual grounding remains a bottleneck for open-source models, and multimodal fusion reduces but does not eliminate errors, underscoring the need for domain-faithful evaluation. The work provides a scalable, open data/code framework to drive future progress in multimodal chemical reasoning.

Abstract

Chemical reasoning inherently integrates visual, textual, and symbolic modalities, yet existing benchmarks rarely capture this complexity, often relying on simple image-text pairs with limited chemical semantics. As a result, the actual ability of Multimodal Large Language Models (MLLMs) to process and integrate chemically meaningful information across modalities remains unclear. We introduce \textbf{ChemVTS-Bench}, a domain-authentic benchmark designed to systematically evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of MLLMs. ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures, with each task presented in three complementary input modes: (1) visual-only, (2) visual-text hybrid, and (3) SMILES-based symbolic input. This design enables fine-grained analysis of modality-dependent reasoning behaviors and cross-modal integration. To ensure rigorous and reproducible evaluation, we further develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes. Extensive experiments on state-of-the-art MLLMs reveal that visual-only inputs remain challenging, structural chemistry is the hardest domain, and multimodal fusion mitigates but does not eliminate visual, knowledge-based, or logical errors, highlighting ChemVTS-Bench as a rigorous, domain-faithful testbed for advancing multimodal chemical reasoning. All data and code will be released to support future research.

ChemVTS-Bench: Evaluating Visual-Textual-Symbolic Reasoning of Multimodal Large Language Models in Chemistry

TL;DR

ChemVTS-Bench introduces a domain-authentic benchmark to rigorously evaluate Visual-Textual-Symbolic reasoning in chemistry across three input modalities (image, image–text, SMILES). By covering organic, inorganic, and 3D crystal structures and employing an automated two-stage agent-based evaluation, it enables fine-grained analysis of modality-dependent reasoning and cross-modal integration. Key findings show that structural chemistry is the hardest domain, visual grounding remains a bottleneck for open-source models, and multimodal fusion reduces but does not eliminate errors, underscoring the need for domain-faithful evaluation. The work provides a scalable, open data/code framework to drive future progress in multimodal chemical reasoning.

Abstract

Chemical reasoning inherently integrates visual, textual, and symbolic modalities, yet existing benchmarks rarely capture this complexity, often relying on simple image-text pairs with limited chemical semantics. As a result, the actual ability of Multimodal Large Language Models (MLLMs) to process and integrate chemically meaningful information across modalities remains unclear. We introduce \textbf{ChemVTS-Bench}, a domain-authentic benchmark designed to systematically evaluate the Visual-Textual-Symbolic (VTS) reasoning abilities of MLLMs. ChemVTS-Bench contains diverse and challenging chemical problems spanning organic molecules, inorganic materials, and 3D crystal structures, with each task presented in three complementary input modes: (1) visual-only, (2) visual-text hybrid, and (3) SMILES-based symbolic input. This design enables fine-grained analysis of modality-dependent reasoning behaviors and cross-modal integration. To ensure rigorous and reproducible evaluation, we further develop an automated agent-based workflow that standardizes inference, verifies answers, and diagnoses failure modes. Extensive experiments on state-of-the-art MLLMs reveal that visual-only inputs remain challenging, structural chemistry is the hardest domain, and multimodal fusion mitigates but does not eliminate visual, knowledge-based, or logical errors, highlighting ChemVTS-Bench as a rigorous, domain-faithful testbed for advancing multimodal chemical reasoning. All data and code will be released to support future research.

Paper Structure

This paper contains 27 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Comparison of model accuracy on ChemVTS-Bench for instances that provide all three input modalities (Text, Visual, and Visual-Text) simultaneously.
  • Figure 2: Left: an example from MMCR; Right: an example from ChemVTS. As shown, MMCR contains image prompts with weak chemical semantics—often resembling simple “describe-the-picture” tasks—which may overestimate a model’s cross-modal reasoning ability. Moreover, MMCR provides only a visual–text input mode. In contrast, ChemVTS constructs text-only tasks (via OCR extraction and SMILES reconstruction) and visual–text tasks from the original visual-only data, enabling a more comprehensive and fine-grained multimodal evaluation.
  • Figure 3: Representative visual examples from ChemVTS-Bench, highlighting the diverse and challenging chemical structures present in the benchmark.
  • Figure 4: Our Evaluation Pipeline. Visual Text SMILES are used as input tokens for the problem. The prompts for the System and the User follow a unified input format with slight variations. The answers generated by the MLLM are then processed through a two-stage agent workflow to obtain evaluation outcomes and error analysis, respectively.
  • Figure 5: This figure depicts a specific example of the pipeline for MLLM-generated solution evaluation and Agent-based error diagnosis. The process begins by inputting a ground-truth problem into the MLLM to obtain its solution. Subsequently, the Agent receives the original problem, the reference solution, and the MLLM's solution as inputs, finally producing a diagnosis of the error type.
  • ...and 10 more figures