Table of Contents
Fetching ...

Sherlock: Reliable and Efficient Agentic Workflow Execution

Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, Esha Choukse

TL;DR

Sherlock addresses reliability and latency in agentic workflows by combining counterfactual vulnerability analysis to guide topology-aware verifier placement, a cost-aware verifier selector learned via preference optimization, and speculative execution to overlap verification with downstream computation. It demonstrates up to 18.3% average accuracy gain, up to 48.7% end-to-end latency reduction, and 26.0% verifier-cost reduction across diverse benchmarks, while maintaining domain adaptability through lightweight onboarding. The approach balances reliability and efficiency without brute-force per-run verification, making it practical for dynamically generated workflows. Overall, Sherlock provides a scalable framework for robust, low-latency agentic reasoning in real-world settings.

Abstract

With the increasing adoption of large language models (LLM), agentic workflows, which compose multiple LLM calls with tools, retrieval, and reasoning steps, are increasingly replacing traditional applications. However, such workflows are inherently error-prone: incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output. Recent work proposes integrating verifiers that validate LLM output or actions, such as self-reflection, debate, or LLM-as-a-judge mechanisms. Yet, verifying every step introduces significant latency and cost overheads. In this work, we seek to answer three key questions: which nodes in a workflow are most error-prone and thus deserve costly verification, how to select the most appropriate verifier for each node, and how to use verification with minimal impact to latency? Our solution, Sherlock, addresses these using counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaching cost-optimal verifiers only where necessary. At runtime, Sherlock speculatively executes downstream tasks to reduce latency overhead, while verification runs in the background. If verification fails, execution is rolled back to the last verified output. Compared to the non-verifying baseline, Sherlock delivers an 18.3% accuracy gain on average across benchmarks. Sherlock reduces workflow execution time by up to 48.7% over non-speculative execution and lowers verification cost by 26.0% compared to the Monte Carlo search-based method, demonstrating that principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows.

Sherlock: Reliable and Efficient Agentic Workflow Execution

TL;DR

Sherlock addresses reliability and latency in agentic workflows by combining counterfactual vulnerability analysis to guide topology-aware verifier placement, a cost-aware verifier selector learned via preference optimization, and speculative execution to overlap verification with downstream computation. It demonstrates up to 18.3% average accuracy gain, up to 48.7% end-to-end latency reduction, and 26.0% verifier-cost reduction across diverse benchmarks, while maintaining domain adaptability through lightweight onboarding. The approach balances reliability and efficiency without brute-force per-run verification, making it practical for dynamically generated workflows. Overall, Sherlock provides a scalable framework for robust, low-latency agentic reasoning in real-world settings.

Abstract

With the increasing adoption of large language models (LLM), agentic workflows, which compose multiple LLM calls with tools, retrieval, and reasoning steps, are increasingly replacing traditional applications. However, such workflows are inherently error-prone: incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output. Recent work proposes integrating verifiers that validate LLM output or actions, such as self-reflection, debate, or LLM-as-a-judge mechanisms. Yet, verifying every step introduces significant latency and cost overheads. In this work, we seek to answer three key questions: which nodes in a workflow are most error-prone and thus deserve costly verification, how to select the most appropriate verifier for each node, and how to use verification with minimal impact to latency? Our solution, Sherlock, addresses these using counterfactual analysis on agentic workflows to identify error-prone nodes and selectively attaching cost-optimal verifiers only where necessary. At runtime, Sherlock speculatively executes downstream tasks to reduce latency overhead, while verification runs in the background. If verification fails, execution is rolled back to the last verified output. Compared to the non-verifying baseline, Sherlock delivers an 18.3% accuracy gain on average across benchmarks. Sherlock reduces workflow execution time by up to 48.7% over non-speculative execution and lowers verification cost by 26.0% compared to the Monte Carlo search-based method, demonstrating that principled, fault-aware verification effectively balances efficiency and reliability in agentic workflows.

Paper Structure

This paper contains 43 sections, 12 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: Example agentic workflow where a user submits a task in natural language and an LLM-based planner generates a workflow composed of multiple subtasks (W1--W4). Each node may involve various tools (e.g., web search, file retrieval). In this work, we assume adding per-node verifiers (V1--V4).
  • Figure 2: State-of-the-art LLM verifiers. Grey indicates extra LLM calls from verification, and each dollar emoji indicates an advanced model (more expensive). Judge indicates a judge LLM.
  • Figure 3: Verifier Characterization. Comparison of different verifiers' performance across task categories. Latency and cost are normalized to baseline execution latency and cost.
  • Figure 4: Verifier utility by task. Utility is computed as $accuracy\_gain - \lambda \cdot cost$, with higher values indicating better cost effectiveness. Detailed explanation on verifiers utility in §\ref{['sec:verifier_selector']}.
  • Figure 5: Verified Output Redundancy. Match rate denotes the proportion of verified outputs matching the originals.
  • ...and 12 more figures