Table of Contents
Fetching ...

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie

TL;DR

AtomWorld introduces the first benchmark focused on CIF-based motor skills to evaluate LLM spatial reasoning in crystallography. Built as a scalable data generator, it pairs before/after CIFs with action prompts and uses StructureMatcher to quantify structural fidelity, enabling objective cross-model comparisons. Experiments reveal that frontier LLMs handle basic edits but struggle with multi-step spatial tasks like rotations, though tool-augmented and retrieval-based workflows yield measurable gains. The work highlights a gap between CIF syntax literacy and spatial reasoning and positions AtomWorld as a foundational stepping stone toward autonomous, agentic materials discovery workflows.

Abstract

Large Language Models (LLMs) excel at textual reasoning and are beginning to develop spatial understanding, prompting the question of whether these abilities can be combined for complex, domain-specific tasks. This question is essential in fields like materials science, where deep understanding of 3D atomic structures is fundamental. While initial studies have successfully applied LLMs to tasks involving pure crystal generation or coordinate understandings, a standardized benchmark to systematically evaluate their core reasoning abilities across diverse atomic structures has been notably absent. To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on tasks based in Crystallographic Information Files (CIFs), a standard structure representation format. These tasks, including structural editing, CIF perception, and property-guided modeling, reveal a critical limitation: current models, despite establishing promising baselines, consistently fail in structural understanding and spatial reasoning. Our experiments show that these models make frequent errors on structure modification tasks, and even in the basic CIF format understandings, potentially leading to cumulative errors in subsequent analysis and materials insights. By defining these standardized tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale modeling, crucial for accelerating materials research and automating scientific workflows.

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

TL;DR

AtomWorld introduces the first benchmark focused on CIF-based motor skills to evaluate LLM spatial reasoning in crystallography. Built as a scalable data generator, it pairs before/after CIFs with action prompts and uses StructureMatcher to quantify structural fidelity, enabling objective cross-model comparisons. Experiments reveal that frontier LLMs handle basic edits but struggle with multi-step spatial tasks like rotations, though tool-augmented and retrieval-based workflows yield measurable gains. The work highlights a gap between CIF syntax literacy and spatial reasoning and positions AtomWorld as a foundational stepping stone toward autonomous, agentic materials discovery workflows.

Abstract

Large Language Models (LLMs) excel at textual reasoning and are beginning to develop spatial understanding, prompting the question of whether these abilities can be combined for complex, domain-specific tasks. This question is essential in fields like materials science, where deep understanding of 3D atomic structures is fundamental. While initial studies have successfully applied LLMs to tasks involving pure crystal generation or coordinate understandings, a standardized benchmark to systematically evaluate their core reasoning abilities across diverse atomic structures has been notably absent. To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on tasks based in Crystallographic Information Files (CIFs), a standard structure representation format. These tasks, including structural editing, CIF perception, and property-guided modeling, reveal a critical limitation: current models, despite establishing promising baselines, consistently fail in structural understanding and spatial reasoning. Our experiments show that these models make frequent errors on structure modification tasks, and even in the basic CIF format understandings, potentially leading to cumulative errors in subsequent analysis and materials insights. By defining these standardized tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale modeling, crucial for accelerating materials research and automating scientific workflows.

Paper Structure

This paper contains 35 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: AtomWorld benchmark flowchart. The AtomWorld generator follows a structured data flow: the random sampler selects a structure from a predefined structure pool (in this work, a subset of CIF files from the Materials Project database Jain2013); the random initializer parametrizes the chosen action template by assigning atom indices and/or positions; the structure operator applies the instantiated action to the original structure to obtain the target structure; and the prompter generates a natural language description aligned with the action. The resulting (input structure, action prompt) pairs are then fed into the LLM agent system, whose generated structure is compared against the target structure using the StructureMatcher from pymatgenongPythonMaterialsGenomics2013 to compute the desired evaluation metric.
  • Figure 2: a. Success rate metric across AtomWorld, CIF-Repair, CIF-Gen and StructProp datasets. b. Mean max_dist metric across AtomWorld and CIF-Gen datasets. c, d. Parameter scaling results on Qwen3 series.
  • Figure 3: The number of correctly generated CIFs for each structure type in the CIF-Gen task. The squares marked in red indicate cases where the single correct generation is the standard prototype. The right side shows the specific 3D crystal structures for each type, where the chemical compositions in red represent the standard prototypes.
  • Figure 4: The workflow of a specific insert_between task.
  • Figure 5: The flowchart for the code generation-based approach for the AtomWorld benchmark tests.
  • ...and 1 more figures