Table of Contents
Fetching ...

ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

Qiang Xu, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li

TL;DR

ChemO introduces a rigorous, chemistry-specific multimodal benchmark derived from IChO 2025 and paired with AER to convert visual-output tasks into machine-readable formats. The authors propose ChemLabs, a hierarchical multi-agent system with Perception, Solving, and Audit labs, enabling dynamic task decomposition, structured visual guidance, and dual-stage verification. Empirical results on multiple frontier MLLMs show that combining Structured Visual Enhancement with ChemLabs yields substantial performance gains, with Gemini-2.5 Pro reaching 93.6/100 and surpassing estimated gold-medal thresholds. The work advances automated evaluation of complex chemical reasoning and sets a new state-of-the-art for multimodal, olympiad-style chemistry problem solving, with broad implications for diagnostic tooling and scalable assessment.

Abstract

Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO

ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

TL;DR

ChemO introduces a rigorous, chemistry-specific multimodal benchmark derived from IChO 2025 and paired with AER to convert visual-output tasks into machine-readable formats. The authors propose ChemLabs, a hierarchical multi-agent system with Perception, Solving, and Audit labs, enabling dynamic task decomposition, structured visual guidance, and dual-stage verification. Empirical results on multiple frontier MLLMs show that combining Structured Visual Enhancement with ChemLabs yields substantial performance gains, with Gemini-2.5 Pro reaching 93.6/100 and surpassing estimated gold-medal thresholds. The work advances automated evaluation of complex chemical reasoning and sets a new state-of-the-art for multimodal, olympiad-style chemistry problem solving, with broad implications for diagnostic tooling and scalable assessment.

Abstract

Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO

Paper Structure

This paper contains 34 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of ChemLabs, a hierarchical multi-agent framework for solving IChO problems. Each complete question is first received by a manager agent, which autonomously decomposes it into sub-tasks (e.g., 1.1, 1.2, 1.3) and dispatches them to domain-specific solvers according to their types. Visual sub-tasks are processed through the Perception Lab for structured interpretation, followed by task-specific reasoning in the Solving Lab. The resulting answers are refined by the introspector and verified in the Audit Lab via Chem-Auditor and General-Auditor. This design enables adaptive task allocation, modular reasoning, and interpretable multi-agent collaboration across diverse chemical problem types.
  • Figure 2: Example reaction scheme used to construct the structured visual guidance $\mathcal{G}$. The scheme, taken from Problem 1 of the IChO 2025 theoretical exam, depicts the transformations of i-Cy into B, A, and C under different reaction conditions.