Table of Contents
Fetching ...

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim

TL;DR

Results show that mid-sized models (7B--70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models.

Abstract

We introduce OPENXRD, a comprehensive benchmarking framework for evaluating large language models (LLMs) and multimodal LLMs (MLLMs) in crystallography question answering. The framework measures context assimilation, or how models use fixed, domain-specific supporting information during inference. The framework includes 217 expert-curated X-ray diffraction (XRD) questions covering fundamental to advanced crystallographic concepts, each evaluated under closed-book (without context) and open-book (with context) conditions, where the latter includes concise reference passages generated by GPT-4.5 and refined by crystallography experts. We benchmark 74 state-of-the-art LLMs and MLLMs, including GPT-4, GPT-5, O-series, LLaVA, LLaMA, QWEN, Mistral, and Gemini families, to quantify how different architectures and scales assimilate external knowledge. Results show that mid-sized models (7B--70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models. Expert-reviewed materials provide significantly higher improvements than AI-generated ones even when token counts are matched, confirming that content quality, not quantity, drives performance. OPENXRD offers a reproducible diagnostic benchmark for assessing reasoning, knowledge integration, and guidance sensitivity in scientific domains, and provides a foundation for future multimodal and retrieval-augmented crystallography systems.

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

TL;DR

Results show that mid-sized models (7B--70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models.

Abstract

We introduce OPENXRD, a comprehensive benchmarking framework for evaluating large language models (LLMs) and multimodal LLMs (MLLMs) in crystallography question answering. The framework measures context assimilation, or how models use fixed, domain-specific supporting information during inference. The framework includes 217 expert-curated X-ray diffraction (XRD) questions covering fundamental to advanced crystallographic concepts, each evaluated under closed-book (without context) and open-book (with context) conditions, where the latter includes concise reference passages generated by GPT-4.5 and refined by crystallography experts. We benchmark 74 state-of-the-art LLMs and MLLMs, including GPT-4, GPT-5, O-series, LLaVA, LLaMA, QWEN, Mistral, and Gemini families, to quantify how different architectures and scales assimilate external knowledge. Results show that mid-sized models (7B--70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models. Expert-reviewed materials provide significantly higher improvements than AI-generated ones even when token counts are matched, confirming that content quality, not quantity, drives performance. OPENXRD offers a reproducible diagnostic benchmark for assessing reasoning, knowledge integration, and guidance sensitivity in scientific domains, and provides a foundation for future multimodal and retrieval-augmented crystallography systems.

Paper Structure

This paper contains 31 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Example question from the OPENXRD dataset in closed-book format. Each question includes a concise prompt, multiple-choice options (3-4 answers), with one correct answer. This illustrates the baseline evaluation condition where models must rely solely on their internal knowledge without external supporting materials.
  • Figure 2: A word cloud of the subtask labels from our human-curated crystallography dataset. Larger words indicate subtasks with a higher number of questions, illustrating the breadth of topics (e.g., diffraction fundamentals, lattice geometry, advanced structure analysis). This distribution underscores the diversity and domain complexity captured in our QA benchmark, spanning both basic definitions (e.g., counting crystal systems) and intricate reflections (e.g., twin boundaries, space-group anomalies).
  • Figure 3: The expert-reviewed version provides essential clarifications about the angular dependence of atomic scattering factors, emphasizes their direct relationship to atomic electron count, and highlights practical implications for interpreting diffraction data, elements often missed in AI-generated explanations.
  • Figure 4: An open-book mode example with expert-reviewed supporting material that clearly explains how a path difference of $n\lambda$ leads to constructive interference, with additional contextual information about its relevance to crystallography. The correct answer is highlighted in green.
  • Figure 5: Illustration of our open-book QA pipeline for crystallography. In closed-book mode, the model sees only the question (left rotated box). In open-book mode, it also receives domain-specific supporting textual material (right rotated box), which is concatenated and fed to the LLM (center pipeline), producing the final QA result.
  • ...and 6 more figures