Table of Contents
Fetching ...

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Joel Strickland, Arjun Vijeta, Chris Moores, Oliwia Bodek, Bogdan Nenchev, Thomas Whitehead, Charles Phillips, Karl Tassenberg, Gareth Conduit, Ben Pellegrini

TL;DR

It is argued that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Abstract

Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements--deterministic, constrained execution and conversational flexibility without workflow rigidity--together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action--including cross-step dependencies--validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol--15 independent sessions across 3 LLM families--yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front--no reviewed system achieves both high flexibility and high determinism--but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

TL;DR

It is argued that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Abstract

Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements--deterministic, constrained execution and conversational flexibility without workflow rigidity--together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action--including cross-step dependencies--validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol--15 independent sessions across 3 LLM families--yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front--no reviewed system achieves both high flexibility and high determinism--but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.
Paper Structure (59 sections, 6 figures, 12 tables)

This paper contains 59 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Thematic landscape from practitioner interviews. Each numbered circle represents one of the 17 themes; the key maps numbers to theme names. Position shows stakeholder breadth ($x$-axis: number of stakeholders mentioning the theme, out of 10) and total mention count ($y$-axis). Colour encodes theme group: blue = determinism pole, orange = flexibility pole, green = cross-cutting / boundary properties, grey = out of architectural scope. Dashed lines mark the median stakeholder breadth and median total mentions; the shaded quadrant highlights themes above both medians. Dotted ellipses enclose themes that share approximately the same grid position.
  • Figure 2: Architectural design space for conversational scientific workflows. Systems are positioned by mean ED and CF scores across 15 independent scoring runs (Appendix \ref{['app:scoring-protocol']} reports consensus medians; means are used here for visual separation of co-located systems); point labels correspond to system IDs. Colours encode five groups: generative (red), tool-augmented (orange), schema-gated (green), workflow + NL (purple), and workflow-centric (blue). The dashed line is the empirical Pareto front connecting non-dominated positions with piecewise-linear segments (i.e. straight lines between observed Pareto-optimal points); the shaded region below the front contains dominated positions. The gold star marks the ideal region of high flexibility and high determinism. Caveat: scores are ordinal (1--5) and non-quantitative; the piecewise-linear front is displayed for visual communication of the trade-off shape and should not be interpreted as implying interval-scale distances between positions or a continuous functional relationship.
  • Figure 3: Reference architecture separating conversational authority from execution authority. Context and state (user query, chat history, system information, internal data, and prior tool/workflow outputs) are assembled for an LLM orchestrator (planner/reasoner), which produces an assistant message and, optionally, an action proposal. Proposals enter the execution-authority gate (red dashed region): an output parser validates them against a JSON schema or workflow specification. Invalid proposals trigger a clarification loop; valid proposals are forwarded to the workflow executor. Resulting data, logs, and artifacts update the shared context for subsequent turns. The LLM may converse freely, but cannot execute---execution authority resides entirely in schema validation.
  • Figure 4: Domain tool schema validation and inclusion in the registry. New domain tools undergo schema and compliance checks---including type validation, documentation completeness, and service availability---before being added to the validated tool registry. Only schema-verified domain tools become available for workflow execution.
  • Figure 5: Workflow schema composition in the validated registry. Each workflow combines schema-validated domain tools into a DAG, defining explicit data and parameter flow between steps. Only workflows that pass acyclicity, type-compatibility, and parameter-resolution checks are included in the validated workflow registry.
  • ...and 1 more figures