100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Yeounoh Chung; Rushabh Desai; Jian He; Yu Xiao; Thibaud Hottelier; Yves-Laurent Kom Samo; Pushkar Kadilkar; Xianshun Chen; Sam Idicula; Fatma Özcan; Alon Halevy; Yannis Papakonstantinou

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou

Abstract

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter ($AI.IF$) operator and also important gains for semantic ranking ($AI.RANK$). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Abstract

) operator and also important gains for semantic ranking (

). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

Paper Structure (21 sections, 7 figures, 15 tables)

This paper contains 21 sections, 7 figures, 15 tables.

Introduction
Related Works
A Case for Lightweight Proxy Model For Efficient AI Query Engine
Lightweight Proxy Approximation
Online and Offline Proxy Model Training
Imbalanced Label Training
Automatic Proxy Model Evaluation
Adaptive Proxy Model Selection
Detailed Analysis of Proxy Approximation
Evaluation Setup
Fast Semantic Filtering
Fast Semantic Ranking
Impact of Sampling Strategy
Imbalanced Data Label Challenge
Role of Embedding Quality
...and 6 more sections

Figures (7)

Figure 1: AI query execution plan construction with proxy model approximation process. We parse the AI query and extract semantic operators ($O_i$) along with the semantic queries/prompts ($Q_i$), unstructured data columns ($C_i$). We apply proxy model approximation for each operator. Proxy models can be trained online with best known or default configurations, or prepared offline for known query patterns depending on the analytics database architectures (OLAP or HTAP). Adaptive proxy model selection between the proxy and LLM, based on automatic online model evaluation results, allows us to choose a cost-efficient and accurate execution strategy.
Figure 2: Relative wall-clock time of each step of the proxy model optimization with pre-computed embeddings. Only the sampling and prediction steps scale with the input relation size. The fraction of time spent doing prediction does not increase between 1m and 10m rows because BigQuery is able to use more parallelism as the amount of data increases.
Figure 3: Proxy model performance (nDCG@10 on online training sample) for rubric-based ranking on TREC-DL-2022. The online Proxy performs better after labeling >120 samples; Adaptive Proxy relies on LLM until the Proxy becomes as good as the baseline.
Figure 4: Impact of sampling strategies on training data imbalance ratios, measured across various datasets of varying degrees of dataset characteristics: imbalance ratio ($\rho$), for classification benchmark datasets (a)(b); Relevant docuemtns per query ($\gamma$) / the number of corpus documents, for IR benchmark datasets (c)(d).
Figure 5: Impact of different imbalanced label training techniques, as described in Section \ref{['sec:imbalanced_training_techniques']}.
...and 2 more figures

Theorems & Definitions (2)

definition 1
definition 2

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Abstract

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Authors

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)