Table of Contents
Fetching ...

MaterialsGalaxy: A Platform Fusing Experimental and Theoretical Data in Condensed Matter Physics

Tiannian Zhu, Zhong Fang, Quansheng Wu, Hongming Weng

TL;DR

MaterialsGalaxy presents a structure similarity-driven platform that bridges experimental and theoretical data in condensed matter physics by transforming crystal structures into fingerprints and indexing them for fast vector-based fusion. The system standardizes heterogeneous data, links records via near-real-time similarity searches, and enriches material profiles with direct and analog information, augmented by AI tools for knowledge extraction, structure prediction, and property forecasting. Key contributions include a robust data standardization pipeline, a scalable structure-driven fusion engine, and demonstrated utility through CrGeTe3 and additional materials, supported by a public API and FAIR-aligned data access. This work enables a data-driven materials discovery paradigm that accelerates hypothesis generation, synthesis guidance, and cross-modal insights by tightly integrating experiment, theory, and AI within a unified platform.

Abstract

Modern materials science generates vast and diverse datasets from both experiments and computations, yet these multi-source, heterogeneous data often remain disconnected in isolated "silos". Here, we introduce MaterialsGalaxy, a comprehensive platform that deeply fuses experimental and theoretical data in condensed matter physics. Its core innovation is a structure similarity-driven data fusion mechanism that quantitatively links cross-modal records - spanning diffraction, crystal growth, computations, and literature - based on their underlying atomic structures. The platform integrates artificial intelligence (AI) tools, including large language models (LLMs) for knowledge extraction, generative models for crystal structure prediction, and machine learning property predictors, to enhance data interpretation and accelerate materials discovery. We demonstrate that MaterialsGalaxy effectively integrates these disparate data sources, uncovering hidden correlations and guiding the design of novel materials. By bridging the long-standing gap between experiment and theory, MaterialsGalaxy provides a new paradigm for data-driven materials research and accelerates the discovery of advanced materials.

MaterialsGalaxy: A Platform Fusing Experimental and Theoretical Data in Condensed Matter Physics

TL;DR

MaterialsGalaxy presents a structure similarity-driven platform that bridges experimental and theoretical data in condensed matter physics by transforming crystal structures into fingerprints and indexing them for fast vector-based fusion. The system standardizes heterogeneous data, links records via near-real-time similarity searches, and enriches material profiles with direct and analog information, augmented by AI tools for knowledge extraction, structure prediction, and property forecasting. Key contributions include a robust data standardization pipeline, a scalable structure-driven fusion engine, and demonstrated utility through CrGeTe3 and additional materials, supported by a public API and FAIR-aligned data access. This work enables a data-driven materials discovery paradigm that accelerates hypothesis generation, synthesis guidance, and cross-modal insights by tightly integrating experiment, theory, and AI within a unified platform.

Abstract

Modern materials science generates vast and diverse datasets from both experiments and computations, yet these multi-source, heterogeneous data often remain disconnected in isolated "silos". Here, we introduce MaterialsGalaxy, a comprehensive platform that deeply fuses experimental and theoretical data in condensed matter physics. Its core innovation is a structure similarity-driven data fusion mechanism that quantitatively links cross-modal records - spanning diffraction, crystal growth, computations, and literature - based on their underlying atomic structures. The platform integrates artificial intelligence (AI) tools, including large language models (LLMs) for knowledge extraction, generative models for crystal structure prediction, and machine learning property predictors, to enhance data interpretation and accelerate materials discovery. We demonstrate that MaterialsGalaxy effectively integrates these disparate data sources, uncovering hidden correlations and guiding the design of novel materials. By bridging the long-standing gap between experiment and theory, MaterialsGalaxy provides a new paradigm for data-driven materials research and accelerates the discovery of advanced materials.

Paper Structure

This paper contains 21 sections, 5 figures.

Figures (5)

  • Figure 1: Architecture of the MaterialsGalaxy platform. The platform employs a systematic workflow to fuse heterogeneous data from three primary channels: (1) existing public databases, (2) electronic laboratory notebooks, and (3) automated literature extraction. Raw data first undergo a rigorous standardization process. The core innovation is the structure vectorization module, which uses representation learning to generate a unique fingerprint for each crystal structure. These fingerprints are indexed in a high-performance vector database, enabling a similarity matching engine to dynamically link disparate records. The resulting fused data backbone supports a rich application layer featuring interactive querying, visualization tools, a RESTful API, and a suite of integrated AI tools (e.g., LLM-based assistants, generative models, and property predictors). Crucially, this architecture not only connects siloed experimental and theoretical data but also enriches them, creating a comprehensive, multi-modal profile for each material based on shared structural features.
  • Figure 2: Overview of integrated data sources and their heterogeneity.a Distribution of entries across the primary integrated databases, with experimental sources shown in blue and theoretical/computational sources in orange. The y-axis is on a logarithmic scale to accommodate the wide range of data volumes. The collection includes a large experimental crystal structure database (COD-derived), various computational property databases (e.g., topological materials, topological phonons), and a unique database of single-crystal growth experiments. (b) A conceptual Venn diagram illustrating the complex relationships of overlap and uniqueness among different data modalities. This highlights the core challenge of data heterogeneity, where, for instance, the materials space of experimental synthesis records, theoretically predicted topological materials, and the general crystal structure database are partially intersecting yet distinct, necessitating a robust data fusion strategy.
  • Figure 3: Data fusion and discovery workflow for CrGeTe3. The platform's dual-axis analysis is triggered by a query for a target material. Horizontal Integration: Direct data for CrGeTe3 are aggregated across multiple modules (e.g., "Crystal Structure", "Electronic Structure") to build a deep, cross-modal profile linking experiment and theory. Vertical Comparison: The material profile is enriched with data from structural analogs. For modules where direct data is missing (e.g., "Single Crystal Growth"), the platform provides actionable references from known similar materials (e.g., AlSiTe3). This comparison is further extended to novel, AI-generated structures (e.g., CuSiTe3), enabling the exploration of uncharted chemical space for accelerated materials discovery.
  • Figure 4: Integrated data visualization for CoSi, a topological phonon material. Horizontal integration aggregates multi-modal data for CoSi, including crystal structure and diffraction patterns, electronic band structure, topological classification, phononic dispersion, and consolidated single-crystal growth records from multiple experiments. Vertical comparison identifies structurally similar materials for both single-crystal growth and topological phonon properties, alongside AI-generated candidate structures (MoAs), demonstrating the platform's capability to connect experimental synthesis data, theoretical calculations, and computational predictions in a unified framework.
  • Figure 5: Integrated data visualization for LiNbO3, a benchmark nonlinear optical material. Horizontal integration summarizes experimental and theoretical data including crystal growth, electronic, and optical properties. Vertical comparison lists structurally similar compounds identified through vector-based similarity search, enabling comparative analysis within the Li–Nb–O materials family.