Table of Contents
Fetching ...

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

Dimitri Yatsenko, Thinh T. Nguyen

TL;DR

DataJoint 2.0 presents a relational workflow model that unifies data structure, data, and computation into a single tractable substrate for agentic scientific workflows. It introduces four technical innovations—object-augmented schemas, semantic matching, extensible type systems, and automated per-table job management—paired with a managed platform and open-source core to support scalable SciOps. The approach provides ACID transactional guarantees across relational and object stores, provenance-aware joins, and deterministic distributed computation, enabling AI agents to reason about data lineage and execution state safely. This framework offers practical pathways for robust, auditable, and collaborative human–agent science, with demonstrated adoption in neuroscience and a clear migration path from existing CWL/OL workflows to a unified schema-driven paradigm.

Abstract

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint creates a substrate for SciOps where agents can participate in scientific workflows without risking data corruption.

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

TL;DR

DataJoint 2.0 presents a relational workflow model that unifies data structure, data, and computation into a single tractable substrate for agentic scientific workflows. It introduces four technical innovations—object-augmented schemas, semantic matching, extensible type systems, and automated per-table job management—paired with a managed platform and open-source core to support scalable SciOps. The approach provides ACID transactional guarantees across relational and object stores, provenance-aware joins, and deterministic distributed computation, enabling AI agents to reason about data lineage and execution state safely. This framework offers practical pathways for robust, auditable, and collaborative human–agent science, with demonstrated adoption in neuroscience and a clear migration path from existing CWL/OL workflows to a unified schema-driven paradigm.

Abstract

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint creates a substrate for SciOps where agents can participate in scientific workflows without risking data corruption.
Paper Structure (34 sections, 2 figures, 1 table)

This paper contains 34 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: DataJoint diagram of a liquid chromatography--mass spectrometry (LC-MS) data processing pipeline. Green rectangles are manual tables, gray boxes are lookup tables, blue/red ellipses are imported/computed tables. Solid lines indicate identity inheritance; dashed lines indicate references. Part tables (e.g., Acquisition.Scan) appear as plain text. The workflow flows left-to-right: biological samples $\rightarrow$ instrument sessions $\rightarrow$ scan acquisition $\rightarrow$ spectral analysis $\rightarrow$ peak detection. See github.com/datajoint/lcms-demo.
  • Figure 2: DataJoint platform architecture. The open-source Python library provides the relational workflow model---schema definition, query algebra, and distributed computation. This core integrates with a relational database (system of record), object storage (scalable data), and code repositories (version-controlled pipeline definitions). The managed platform adds infrastructure, observability, and orchestration for production deployments.