Table of Contents
Fetching ...

ZnTrack -- Data as Code

Fabian Zills, Moritz Schäfer, Samuel Tovey, Johannes Kästner, Christian Holm

TL;DR

ZnTrack introduces Data as Code by embedding data generation, versioning, and analysis within a Git-backed, Python-driven workflow framework. It builds a theory and architecture around computational graphs, Node definitions, and graph-serialized configurations to enable reproducible, shareable data pipelines with data stored alongside code. Demonstrations in atomistic simulations and a Random Forest example illustrate parallel execution, experiment management, and data availability within open-source tooling. The approach aims to lower overhead, enhance collaboration, and promote FAIR data practices in data-driven research.

Abstract

The past decade has seen tremendous breakthroughs in computation and there is no indication that this will slow any time soon. Machine learning, large-scale computing resources, and increased industry focus have resulted in rising investments in computer-driven solutions for data management, simulations, and model generation. However, with this growth in computation has come an even larger expansion of data and with it, complexity in data storage, sharing, and tracking. In this work, we introduce ZnTrack, a Python-driven data versioning tool. ZnTrack builds upon established version control systems to provide a user-friendly and easy-to-use interface for tracking parameters in experiments, designing workflows, and storing and sharing data. From this ability to reduce large datasets to a simple Python script emerges the concept of Data as Code, a core component of the work presented here and an undoubtedly important concept as the age of computation continues to evolve. ZnTrack offers an open-source, FAIR data compatible Python package to enable users to harness these concepts of the future.

ZnTrack -- Data as Code

TL;DR

ZnTrack introduces Data as Code by embedding data generation, versioning, and analysis within a Git-backed, Python-driven workflow framework. It builds a theory and architecture around computational graphs, Node definitions, and graph-serialized configurations to enable reproducible, shareable data pipelines with data stored alongside code. Demonstrations in atomistic simulations and a Random Forest example illustrate parallel execution, experiment management, and data availability within open-source tooling. The approach aims to lower overhead, enhance collaboration, and promote FAIR data practices in data-driven research.

Abstract

The past decade has seen tremendous breakthroughs in computation and there is no indication that this will slow any time soon. Machine learning, large-scale computing resources, and increased industry focus have resulted in rising investments in computer-driven solutions for data management, simulations, and model generation. However, with this growth in computation has come an even larger expansion of data and with it, complexity in data storage, sharing, and tracking. In this work, we introduce ZnTrack, a Python-driven data versioning tool. ZnTrack builds upon established version control systems to provide a user-friendly and easy-to-use interface for tracking parameters in experiments, designing workflows, and storing and sharing data. From this ability to reduce large datasets to a simple Python script emerges the concept of Data as Code, a core component of the work presented here and an undoubtedly important concept as the age of computation continues to evolve. ZnTrack offers an open-source, FAIR data compatible Python package to enable users to harness these concepts of the future.
Paper Structure (25 sections, 1 equation, 10 figures)

This paper contains 25 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: Combination of GIT and DVC in a local repository connected to a GIT remote and data remote.
  • Figure 2: Experiment versioning using GIT. Each Experiment represents a detached commit. The best experiment is committed and new experiments are performed based on this commit.
  • Figure 3: Illustration of a .
  • Figure 4: Illustration of the paradigm.
  • Figure 5: The inputs and outputs of a node are split into and attributes. The attributes describe file paths whilst the attributes can contain arbitrary data and are managed by ZnTrack .
  • ...and 5 more figures