Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Joel Strickland; Arjun Vijeta; Chris Moores; Oliwia Bodek; Bogdan Nenchev; Thomas Whitehead; Charles Phillips; Karl Tassenberg; Gareth Conduit; Ben Pellegrini

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Joel Strickland, Arjun Vijeta, Chris Moores, Oliwia Bodek, Bogdan Nenchev, Thomas Whitehead, Charles Phillips, Karl Tassenberg, Gareth Conduit, Ben Pellegrini

TL;DR

It is argued that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Abstract

Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements--deterministic, constrained execution and conversational flexibility without workflow rigidity--together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action--including cross-step dependencies--validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol--15 independent sessions across 3 LLM families--yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front--no reviewed system achieves both high flexibility and high determinism--but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

TL;DR

Abstract

Paper Structure (59 sections, 6 figures, 12 tables)

This paper contains 59 sections, 6 figures, 12 tables.

Introduction
Relationship to prior work.
Contributions.
User Research and Requirements Elicitation
Thematic landscape
Two architectural requirements
Req A. Execution determinism.
Req B. Conversational flexibility.
Boundary properties.
Operationalisation.
Execution Authority and Architectural Paradigms for Conversational Scientific Workflows
Review scope and classification scheme
Ordinal scoring for the design-space plot (ED/CF).
Scoring protocol.
Architectural design space
...and 44 more sections

Figures (6)

Figure 1: Thematic landscape from practitioner interviews. Each numbered circle represents one of the 17 themes; the key maps numbers to theme names. Position shows stakeholder breadth ($x$-axis: number of stakeholders mentioning the theme, out of 10) and total mention count ($y$-axis). Colour encodes theme group: blue = determinism pole, orange = flexibility pole, green = cross-cutting / boundary properties, grey = out of architectural scope. Dashed lines mark the median stakeholder breadth and median total mentions; the shaded quadrant highlights themes above both medians. Dotted ellipses enclose themes that share approximately the same grid position.
Figure 2: Architectural design space for conversational scientific workflows. Systems are positioned by mean ED and CF scores across 15 independent scoring runs (Appendix \ref{['app:scoring-protocol']} reports consensus medians; means are used here for visual separation of co-located systems); point labels correspond to system IDs. Colours encode five groups: generative (red), tool-augmented (orange), schema-gated (green), workflow + NL (purple), and workflow-centric (blue). The dashed line is the empirical Pareto front connecting non-dominated positions with piecewise-linear segments (i.e. straight lines between observed Pareto-optimal points); the shaded region below the front contains dominated positions. The gold star marks the ideal region of high flexibility and high determinism. Caveat: scores are ordinal (1--5) and non-quantitative; the piecewise-linear front is displayed for visual communication of the trade-off shape and should not be interpreted as implying interval-scale distances between positions or a continuous functional relationship.
Figure 3: Reference architecture separating conversational authority from execution authority. Context and state (user query, chat history, system information, internal data, and prior tool/workflow outputs) are assembled for an LLM orchestrator (planner/reasoner), which produces an assistant message and, optionally, an action proposal. Proposals enter the execution-authority gate (red dashed region): an output parser validates them against a JSON schema or workflow specification. Invalid proposals trigger a clarification loop; valid proposals are forwarded to the workflow executor. Resulting data, logs, and artifacts update the shared context for subsequent turns. The LLM may converse freely, but cannot execute---execution authority resides entirely in schema validation.
Figure 4: Domain tool schema validation and inclusion in the registry. New domain tools undergo schema and compliance checks---including type validation, documentation completeness, and service availability---before being added to the validated tool registry. Only schema-verified domain tools become available for workflow execution.
Figure 5: Workflow schema composition in the validated registry. Each workflow combines schema-validated domain tools into a DAG, defining explicit data and parameter flow between steps. Only workflows that pass acyclicity, type-compatibility, and parameter-resolution checks are included in the validated workflow registry.
...and 1 more figures

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

TL;DR

Abstract

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Authors

TL;DR

Abstract

Table of Contents

Figures (6)