Table of Contents
Fetching ...

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang

TL;DR

This work introduces ToxiMol, the first benchmark for assessing general-purpose multimodal LLMs on molecular toxicity repair, paired with the ToxiEval multi-criteria evaluation framework. The benchmark uses a dataset of 660 toxic molecules across 11 tasks and requires models to generate structurally valid, less-toxic substitutes while satisfying drug-likeness and synthetic feasibility constraints. A mechanism-aware prompting pipeline and an automated evaluation chain enable standardized, objective comparison across 43 MLLMs, revealing that current models struggle with multi-endpoint toxicity repair but show emerging capabilities in toxicity understanding and structure-aware editing. The findings highlight both the need for more robust multimodal alignment and the potential of domain-specific pretraining to bridge the gap toward practical molecular detoxification workflows.

Abstract

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

TL;DR

This work introduces ToxiMol, the first benchmark for assessing general-purpose multimodal LLMs on molecular toxicity repair, paired with the ToxiEval multi-criteria evaluation framework. The benchmark uses a dataset of 660 toxic molecules across 11 tasks and requires models to generate structurally valid, less-toxic substitutes while satisfying drug-likeness and synthetic feasibility constraints. A mechanism-aware prompting pipeline and an automated evaluation chain enable standardized, objective comparison across 43 MLLMs, revealing that current models struggle with multi-endpoint toxicity repair but show emerging capabilities in toxicity understanding and structure-aware editing. The findings highlight both the need for more robust multimodal alignment and the potential of domain-specific pretraining to bridge the gap toward practical molecular detoxification workflows.

Abstract

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

Paper Structure

This paper contains 43 sections, 2 equations, 11 figures, 15 tables, 2 algorithms.

Figures (11)

  • Figure 1: A radar plot of success rates (%) of representative MLLMs openai_gpt52_2025claude37sonnet2024wang2025internvl3bai2025qwen2 across 11 toxicity repair tasks. To enhance visual contrast, the two low-performing tasks are highlighted in a light blue area with axes scaled to 0--8%, while other tasks use a 0--80% range.
  • Figure 2: Statistics of toxic molecules and the number of clusters for each task. Values are plotted along the radius on a logarithmic scale ($10^{1}$--$10^{4}$); the dashed ring at $n=60$ indicates the balanced sampling size determined by the Carcinogens bottleneck constraint.
  • Figure 3: A repair example on the AMES task (binary classification, $\mathcal{S}_{\text{safe}} = 1$) using the ToxiMol benchmark. The input includes the SMILES of Coumarin 6H pubchem94022, its 2D molecular image, and a task-adaptive prompt annotation pipeline. After MLLM-mediated toxicity repair, three candidate molecules are generated. Due to structural invalidity, the ToxiEval evaluation chain first filters out Candidate 2. Among the remaining molecules, Candidate 1 is excluded because it fails the safety threshold in the multi-criteria assessment. Candidate 3 passes all stages of the ToxiEval chain and is identified as a successful repair.
  • Figure 4: Distribution of validity and success rates (%) of representative MLLMs on the ToxiMol benchmark. Panel (a) shows the structural validity rates of each model, and panel (b) shows their repair success rates.
  • Figure 5: Effect of candidate counts $k \in [1, 9]$ on toxicity repair success rates. Panel (a) shows task-level success rates in a radar chart, with each color representing the number of candidates. Two low-performing tasks are highlighted in a light blue area with axes scaled to 0--8%, while others use a 0--80% range. Panel (b) shows overall success rates as a function of candidate number.
  • ...and 6 more figures