Table of Contents
Fetching ...

Xpose: Bi-directional Engineering for Hidden Query Extraction

Ahana Pradhan, Jayant Haritsa

TL;DR

This work tackles Hidden Query Extraction (HQE) by introducing Xpose, a bi-directional system that combines reverse engineering (XRE) and forward engineering (XFE) to recover complex hidden SQL queries. XRE deterministically builds a seed query by mutating and mutating-minimizing the input database, enabling unions, algebraic predicates, and disjunctions to surface in $Q_{\mathcal{S}}$, while XFE uses LLM-guided synthesis from the business description $TX_Q$ to refine this seed into a semantically equivalent final query $Q_{\mathcal{E}}$. The approach is validated on E-TPCH (TPCH extended with unions and diverse joins) and STACK benchmarks, showing that many industrial-strength HQE cases require the synergy of forward and reverse engineering to achieve accurate extraction. The results indicate that Xpose can extend HQE coverage beyond flat SPJGAOL, though the process remains computationally intensive, underscoring the need for powerful models and offline deployment scenarios. Overall, Xpose demonstrates a practical pathway to recover complex hidden queries with high confidence by leveraging both structured mutation-based clues and AI-assisted synthesis guided by business logic descriptions.

Abstract

Query reverse engineering (QRE) aims to synthesize a SQL query to connect a given database and result instance. A recent variation of QRE is where an additional input, an opaque executable containing a ground-truth query, is provided, and the goal is to non-invasively extract this specific query through only input-output examples. This variant, called Hidden Query Extraction (HQE), has a spectrum of industrial use-cases including query recovery, database security, and vendor migration. The reverse engineering (RE) tools developed for HQE, which are based on database mutation and generation techniques, can only extract flat queries with key-based equi joins and conjunctive arithmetic filter predicates, making them limited wrt both query structure and query operators. In this paper, we present Xpose, a HQE solution that elevates the extraction scope to realistic complex queries, such as those found in the TPCH benchmark. A two-pronged approach is taken: (1) The existing RE scope is substantially extended to incorporate union connectors, algebraic filter predicates, and disjunctions for both values and predicates. (2) The predictive power of LLMs is leveraged to convert business descriptions of the opaque application into extraction guidance, representing ``forward engineering" (FE). The FE module recognizes common constructs, such as nesting of sub-queries, outer joins, and scalar functions. In essence, FE establishes the broad query contours, while RE fleshes out the fine-grained details. We have evaluated Xpose on (a) E-TPCH, a query suite comprising the complete TPCH benchmark extended with queries featuring unions, diverse join types, and sub-queries; and (b) the real-world STACK benchmark. The experimental results demonstrate that its bi-directional engineering approach accurately extracts these complex queries, representing a significant step forward with regard to HQE coverage.

Xpose: Bi-directional Engineering for Hidden Query Extraction

TL;DR

This work tackles Hidden Query Extraction (HQE) by introducing Xpose, a bi-directional system that combines reverse engineering (XRE) and forward engineering (XFE) to recover complex hidden SQL queries. XRE deterministically builds a seed query by mutating and mutating-minimizing the input database, enabling unions, algebraic predicates, and disjunctions to surface in , while XFE uses LLM-guided synthesis from the business description to refine this seed into a semantically equivalent final query . The approach is validated on E-TPCH (TPCH extended with unions and diverse joins) and STACK benchmarks, showing that many industrial-strength HQE cases require the synergy of forward and reverse engineering to achieve accurate extraction. The results indicate that Xpose can extend HQE coverage beyond flat SPJGAOL, though the process remains computationally intensive, underscoring the need for powerful models and offline deployment scenarios. Overall, Xpose demonstrates a practical pathway to recover complex hidden queries with high confidence by leveraging both structured mutation-based clues and AI-assisted synthesis guided by business logic descriptions.

Abstract

Query reverse engineering (QRE) aims to synthesize a SQL query to connect a given database and result instance. A recent variation of QRE is where an additional input, an opaque executable containing a ground-truth query, is provided, and the goal is to non-invasively extract this specific query through only input-output examples. This variant, called Hidden Query Extraction (HQE), has a spectrum of industrial use-cases including query recovery, database security, and vendor migration. The reverse engineering (RE) tools developed for HQE, which are based on database mutation and generation techniques, can only extract flat queries with key-based equi joins and conjunctive arithmetic filter predicates, making them limited wrt both query structure and query operators. In this paper, we present Xpose, a HQE solution that elevates the extraction scope to realistic complex queries, such as those found in the TPCH benchmark. A two-pronged approach is taken: (1) The existing RE scope is substantially extended to incorporate union connectors, algebraic filter predicates, and disjunctions for both values and predicates. (2) The predictive power of LLMs is leveraged to convert business descriptions of the opaque application into extraction guidance, representing ``forward engineering" (FE). The FE module recognizes common constructs, such as nesting of sub-queries, outer joins, and scalar functions. In essence, FE establishes the broad query contours, while RE fleshes out the fine-grained details. We have evaluated Xpose on (a) E-TPCH, a query suite comprising the complete TPCH benchmark extended with queries featuring unions, diverse join types, and sub-queries; and (b) the real-world STACK benchmark. The experimental results demonstrate that its bi-directional engineering approach accurately extracts these complex queries, representing a significant step forward with regard to HQE coverage.

Paper Structure

This paper contains 52 sections, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Linear Extraction Pipeline of Unmasqueunmasque_sigmodkapil_demo
  • Figure 2: Architecture of Xpose
  • Figure 3: Extracting TPCH Q20 using Bidirectional Engineering
  • Figure 4: Database Minimization
  • Figure 5: Extraction Scopes of XRE and XFE
  • ...and 9 more figures