Table of Contents
Fetching ...

Streaming Model Cascades for Semantic SQL

Paweł Liskowski, Kyle Schmaus

Abstract

Modern data warehouses extend SQL with semantic operators that invoke large language models on each qualifying row, but the per-row inference cost is prohibitive at scale. Model cascades reduce this cost by routing most rows through a fast proxy model and delegating uncertain cases to an expensive oracle. Existing frameworks, however, require global dataset access and optimize a single quality metric, limiting their applicability in distributed systems where data is partitioned across independent workers. We present two adaptive cascade algorithms designed for streaming, per-partition execution in which each worker processes its partition independently without inter-worker communication. SUPG-IT extends the SUPG statistical framework to streaming execution with iterative threshold refinement and joint precision-recall guarantees. GAMCAL replaces user-specified quality targets with a learned calibration model: a Generalized Additive Model maps proxy scores to calibrated probabilities with uncertainty quantification, enabling direct optimization of a cost-quality tradeoff through a single parameter. Experiments on six datasets in a production semantic SQL engine show that both algorithms achieve F1 > 0.95 on every dataset. GAMCAL achieves higher F1 per oracle call at cost-sensitive operating points, while SUPG-IT reaches a higher quality ceiling with formal guarantees on precision and recall.

Streaming Model Cascades for Semantic SQL

Abstract

Modern data warehouses extend SQL with semantic operators that invoke large language models on each qualifying row, but the per-row inference cost is prohibitive at scale. Model cascades reduce this cost by routing most rows through a fast proxy model and delegating uncertain cases to an expensive oracle. Existing frameworks, however, require global dataset access and optimize a single quality metric, limiting their applicability in distributed systems where data is partitioned across independent workers. We present two adaptive cascade algorithms designed for streaming, per-partition execution in which each worker processes its partition independently without inter-worker communication. SUPG-IT extends the SUPG statistical framework to streaming execution with iterative threshold refinement and joint precision-recall guarantees. GAMCAL replaces user-specified quality targets with a learned calibration model: a Generalized Additive Model maps proxy scores to calibrated probabilities with uncertainty quantification, enabling direct optimization of a cost-quality tradeoff through a single parameter. Experiments on six datasets in a production semantic SQL engine show that both algorithms achieve F1 > 0.95 on every dataset. GAMCAL achieves higher F1 per oracle call at cost-sensitive operating points, while SUPG-IT reaches a higher quality ceiling with formal guarantees on precision and recall.

Paper Structure

This paper contains 51 sections, 32 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Streaming execution model. Each worker processes its data partition independently, maintaining local threshold estimates and updating them based on its own oracle observations. Workers do not share samples or synchronize.
  • Figure 2: Threshold convergence on synthetic data with overlapping bimodal class distributions ($m = 5{,}000$ records, $k_t = 20$ samples per iteration, $t_R = t_P = 0.8$). The recall threshold $\tau_{\text{low}}$ (orange) rises as oracle samples accumulate. The precision threshold $\tau_{\text{high}}$ (red) descends as the confidence bound on precision tightens. The shaded uncertain region narrows accordingly, reducing oracle delegation.
  • Figure 3: GAM-based calibration on synthetic data with nonlinear miscalibration ($n = 3{,}000$, $\lambda = 0.6$). Left: The GAM calibration curve $g(s)$ (blue) captures the S-shaped departure from the diagonal that Platt scaling (gray dashed) cannot represent. The shaded band shows the 95% confidence interval from the GAM posterior. Right: Reliability diagram comparing raw, Platt-calibrated, and GAM-calibrated scores. GAM calibration reduces expected calibration error from $0.140$ (raw) to $0.005$, compared to $0.047$ for Platt scaling.
  • Figure 4: GAMCAL threshold convergence on synthetic bimodal data ($m = 10{,}000$, batch size $200$, $\rho = 0.03$, $\alpha = 0.35$). Top: Thresholds $\tau_{\text{low}}$ (orange) and $\tau_{\text{high}}$ (red) narrow in discrete steps at retraining events (dotted lines) on the doubling schedule. The shaded region marks the uncertain interval. Bottom: Delegation rate drops from $1.0$ during the cold-start phase to approximately $0.15$ as calibration improves. Each retraining event further reduces delegation.
  • Figure 5: $F_1$ vs. delegation rate for GAMCAL (sweeping $\alpha$) and SUPG-IT (sweeping shared target $t_P = t_R$) across six datasets. Each point is the mean over 10 seeds (error bars: one standard deviation). Dashed horizontal lines mark the proxy-only and oracle baselines. GAMCAL's frontier lies above or overlaps SUPG-IT's on every dataset.
  • ...and 3 more figures