Table of Contents
Fetching ...

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang

TL;DR

A novel system that addresses large-scale semantic analysis practical and efficient by decoupling predicate execution into an offline representation phase and an optimized online filtering phase and proposes two core innovations to achieve significant efficiency.

Abstract

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

TL;DR

A novel system that addresses large-scale semantic analysis practical and efficient by decoupling predicate execution into an offline representation phase and an optimized online filtering phase and proposes two core innovations to achieve significant efficiency.

Abstract

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2 end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

Paper Structure

This paper contains 25 sections, 1 theorem, 15 equations, 15 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

Let $|S'| = pN$ be the sample size. For any thresholds $(l, r)$, $0\le l < r \le 1$ and $\delta>0$, there exists a $\epsilon > 0$ such that if the sample condition satisfies $\mathcal{T}_{S'}(l, r) \le (1-\alpha)F_{S'}^+ - \epsilon$, then the true accuracy satisfies:

Figures (15)

  • Figure 1: A detailed workflow of ScaleDoc -- ScaleDoc adapts pre-calculated semantic embeddings for query-specific online processing. The online process comprises a query-aware lightweight encoder and a subsequent cascade workflow.
  • Figure 2: Example score distributions of different proxies, with low and high data reduction rate.
  • Figure 3: Illustration of the objectives adopted in training ScaleDoc's Query-Aware Encoder.
  • Figure 4: End-to-end latencies and data reduction rate -- We evaluate ScaleDoc and other baselines with accuracy target $\alpha$ = 0.90. The data reduction rate measures the percentage of data that does not require the LLM oracle call, indicating the cost-saving.
  • Figure 5: Breakdown for different approaches over PubMed dataset, measuring average latencies of each stage. -- ScaleDoc (top) presents significance improvement.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 1