Table of Contents
Fetching ...

The Stretto Execution Engine for LLM-Augmented Data Systems

Gabriele Sanmartino, Matthias Urban, Paolo Papotti, Carsten Binnig

TL;DR

Stretto addresses the fundamental runtime–accuracy trade-off in LLM-augmented data systems by providing end-to-end guarantees through a holistic, gradient-based optimizer that jointly selects operator implementations and budgets across a query plan. It expands the physical design space with KV-cache–enabled semantic operators, creating a dense spectrum of cost–quality trade-offs that the optimizer can exploit to meet global precision and recall targets. The architecture combines a global optimizer with offline KV cache creation and online batched execution, yielding substantial speedups over state-of-the-art baselines while maintaining probabilistic guarantees via Bayesian credible intervals. Across multimodal datasets and diverse queries, Stretto demonstrates robust target satisfaction, effective optimization of operator cascades, and significant runtime reductions, illustrating the practical viability of end-to-end quality guarantees in LLM-native data systems.

Abstract

LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime--accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how KV-caching can be used to realize a spectrum of different physical operators that transform a sparse design space into a dense continuum of runtime--accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.

The Stretto Execution Engine for LLM-Augmented Data Systems

TL;DR

Stretto addresses the fundamental runtime–accuracy trade-off in LLM-augmented data systems by providing end-to-end guarantees through a holistic, gradient-based optimizer that jointly selects operator implementations and budgets across a query plan. It expands the physical design space with KV-cache–enabled semantic operators, creating a dense spectrum of cost–quality trade-offs that the optimizer can exploit to meet global precision and recall targets. The architecture combines a global optimizer with offline KV cache creation and online batched execution, yielding substantial speedups over state-of-the-art baselines while maintaining probabilistic guarantees via Bayesian credible intervals. Across multimodal datasets and diverse queries, Stretto demonstrates robust target satisfaction, effective optimization of operator cascades, and significant runtime reductions, illustrating the practical viability of end-to-end quality guarantees in LLM-native data systems.

Abstract

LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime--accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how KV-caching can be used to realize a spectrum of different physical operators that transform a sparse design space into a dense continuum of runtime--accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.
Paper Structure (46 sections, 13 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 46 sections, 13 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Overview of Stretto. The input is a user query as a logical plan with semantic operators, which are globally optimized by selecting model sizes, KV caches, and thresholds. The optimizer allocates error budgets across operators to minimize runtime while meeting the global constraints. On the right, the navigable search space enabled by precomputed compressed KV caches exposes cost–quality trade-offs and allows the optimizer to explore different configurations for each model, modality and compression ratio. The final physical plan executes cascades of increasingly expensive operators so that cheap operators filter most tuples before invoking high-cost models. These cascades involve multiple physical implementations of the same logical operator, with tuples marked as unsure being passed to progressively more accurate and expensive operators.
  • Figure 2: The four steps of optimization in Stretto. The optimizer (1) pulls up semantic operators (yellow and purple) above relational ones (white) to reduce expensive LLM calls, (2) profiles semantic operators on samples to estimate cost--quality tradeoffs, (3) applies gradient-based optimization over a continuous relaxation of operator and parameter search space with Bayesian precision/recall guarantees, and (4) reorders selected physical operators to minimize runtime.
  • Figure 3: How a single tuple is processed by a semantic filter in the continuous relaxation during optimization. Due to the continuous relaxation, physical operators can be selected partially, see the pick factor $\sigma \in [0,1]$. Initially, the input tuple is unsure, and each individual operator can accept, reject, or mark a tuple as unsure. Unsure tuples are passed on to the next operator. When models can be selected partially, these decisions are soft, meaning that a tuple can be partially rejected, accepted, or marked as unsure.
  • Figure 4: Offline, the KV cache for each item is precomputed and stored under profiles for different models and compression ratios. At inference time, for a query (e.g., request to extract diagnoses) a profile is selected, the corresponding KV cache retrieved and its items batched to the LLM for execution. As the KV caches have been computed, the prefill phase is bypassed, significantly reducing latency and compute cost.
  • Figure 5: Top: Shows whether the global targets are met (Target Met> 1). We show the distribution of all queries using a boxplot, with the lower whisker set at the credible level (95%). Thus, an approach meets its statistical guarantees when the entire boxplot is above the Meets Target line. Stretto is the most reliable in this regard, meeting the target overall. Bottom: Runtime comparison of the two optimizers that have statistical guarantees. Stretto outperform Lotus on all datasets.
  • ...and 3 more figures