Table of Contents
Fetching ...

Private Aggregate Queries to Untrusted Databases

Syed Mahbub Hafiz, Chitrabhanu Gupta, Warren Wnuck, Brijesh Vora, Chen-Nee Chuah

TL;DR

The paper tackles privately computing aggregates on untrusted databases, addressing the gap where traditional PIR supports item retrieval but not expressive aggregations. It introduces an information-theoretic IT-PIR framework that uses standard aggregate vectors, index-based queries, and polynomial batch coding to enable private SUM, COUNT, MEAN, MIN, MAX, and histogram queries in a single round while maintaining $t$-privacy. The authors formalize the index-of-aggregate-queries mechanism, prove privacy under batching, and demonstrate practical viability with case studies on MIMIC-III, Twitter, and Yelp, achieving sub-second server processing on multi-million-row datasets using GPU acceleration and outperforming baselines like Goldberg’s IT-PIR. They also discuss deployment considerations such as database updates, minimal configurations, and Byzantine robustness, and outline future directions including JOINs and extended query families, highlighting the method’s potential for privacy-preserving data analytics in outsourced database settings.

Abstract

Private information retrieval (PIR), a privacy-preserving cryptographic tool, solves a simplified version of this problem by hiding the database item that a client accesses. Most PIR protocols require the client to know the exact row index of the intended database item, which cannot support the complicated aggregation-based statistical query in a similar setting. Some works in the PIR space contain keyword searching and SQL-like queries, but most need multiple interactions between the PIR client and PIR servers. Some schemes support searching SQL-like expressive queries in a single round but fail to enable aggregate queries. These schemes are the main focus of this paper. To bridge the gap, we have built a general-purpose novel information-theoretic PIR (IT-PIR) framework that permits a user to fetch the aggregated result, hiding all sensitive sections of the complex query from the hosting PIR server in a single round of interaction. In other words, the server will not know which records contribute to the aggregation. We then evaluate the feasibility of our protocol for both benchmarking and real-world application settings. For instance, in a complex aggregate query to the Twitter microblogging database of 1 million tweets, our protocol takes 0.014 seconds for a PIR server to generate the result when the user is interested in one of 3K user handles. In contrast, for a much-simplified task, not an aggregate but a positional query, Goldberg's regular IT-PIR (Oakland 2007) takes 1.13 seconds. For all possible user handles, 300K, it takes equal time compared to the regular IT-PIR. This example shows that complicated aggregate queries through our framework do not incur additional overhead if not less, compared to the conventional query.

Private Aggregate Queries to Untrusted Databases

TL;DR

The paper tackles privately computing aggregates on untrusted databases, addressing the gap where traditional PIR supports item retrieval but not expressive aggregations. It introduces an information-theoretic IT-PIR framework that uses standard aggregate vectors, index-based queries, and polynomial batch coding to enable private SUM, COUNT, MEAN, MIN, MAX, and histogram queries in a single round while maintaining -privacy. The authors formalize the index-of-aggregate-queries mechanism, prove privacy under batching, and demonstrate practical viability with case studies on MIMIC-III, Twitter, and Yelp, achieving sub-second server processing on multi-million-row datasets using GPU acceleration and outperforming baselines like Goldberg’s IT-PIR. They also discuss deployment considerations such as database updates, minimal configurations, and Byzantine robustness, and outline future directions including JOINs and extended query families, highlighting the method’s potential for privacy-preserving data analytics in outsourced database settings.

Abstract

Private information retrieval (PIR), a privacy-preserving cryptographic tool, solves a simplified version of this problem by hiding the database item that a client accesses. Most PIR protocols require the client to know the exact row index of the intended database item, which cannot support the complicated aggregation-based statistical query in a similar setting. Some works in the PIR space contain keyword searching and SQL-like queries, but most need multiple interactions between the PIR client and PIR servers. Some schemes support searching SQL-like expressive queries in a single round but fail to enable aggregate queries. These schemes are the main focus of this paper. To bridge the gap, we have built a general-purpose novel information-theoretic PIR (IT-PIR) framework that permits a user to fetch the aggregated result, hiding all sensitive sections of the complex query from the hosting PIR server in a single round of interaction. In other words, the server will not know which records contribute to the aggregation. We then evaluate the feasibility of our protocol for both benchmarking and real-world application settings. For instance, in a complex aggregate query to the Twitter microblogging database of 1 million tweets, our protocol takes 0.014 seconds for a PIR server to generate the result when the user is interested in one of 3K user handles. In contrast, for a much-simplified task, not an aggregate but a positional query, Goldberg's regular IT-PIR (Oakland 2007) takes 1.13 seconds. For all possible user handles, 300K, it takes equal time compared to the regular IT-PIR. This example shows that complicated aggregate queries through our framework do not incur additional overhead if not less, compared to the conventional query.
Paper Structure (52 sections, 5 theorems, 20 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 52 sections, 5 theorems, 20 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.7

Fix $\mkern-0.75mu\mathpzc{u}\mkern0.75mu>1$ and $j\in[0 \mathinner{\mkern-1mu\ldotp\mkern-1mu\ldotp\mkern-1mu} \mkern-0.75mu\mathpzc{u}\mkern0.75mu-1]$, and let $\Pi=\bigl(\Pi_1,\ldots,\Pi_{\ell}\bigr)\in\bigl(\hbox{$\mathbb{F}$}^{\mathpzc{p}\mkern1mu\mathbin{\hbox{$\times$}}\mathpzc{r}}\bigr){}^{\

Figures (7)

  • Figure 1: Schematic diagram of the proposed indexes of aggregate queries-powered IT-PIR protocol: In order to demonstrate how our key protocol works, we highlight the example of a query to a hypothetical flight records database for finding the sum of ticket prices of connecting flights. To serve this query, the query vector is first created by the client and then Shamir’s Secret Sharing algorithm is used to generate a secret share vector for each of the servers. Each secret share vector is sent to a server, where they are multiplied with the relevant index of aggregate queries matrix, where each row corresponds to a different flight id. The resultant vector is then multiplied with the copy of the flight records database matrix on that server and the product is returned to the client. Each of these server responses are then used to reconstruct the aggregated response using the Lagrange polynomial interpolation.
  • Figure 2: Query throughput with a varying number of rows in the database. The number of rows in the index of aggregate queries, number of rows for aggregation, and number of indexes of aggregate queries batched are kept constant
  • Figure 3: Query throughput with a varying number of rows in the index of aggregate queries. The number of rows in the database, number of rows for aggregation, and number of indexes of aggregate queries batched are kept constant
  • Figure 4: Query throughput with a varying number of rows for aggregation. The number of rows in the index of aggregate queries, the number of rows in the database, the number of indexes of aggregate queries batched, and the bit size of the modulus are kept constant.
  • Figure 5: Batching time with a varying number of indexes of aggregate queries batched. The number of rows in the index of aggregate queries, number of rows in the database, number of rows for aggregation, and bit size of modulus are kept constant. The NNZ percentages refer to the number of non-zero columns in the batched indexes of aggregate queries, and this percentage becomes higher as progressively larger numbers of indexes are batched.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition 4.1
  • Definition 4.3
  • Definition 4.6
  • Theorem 4.7
  • Definition 4.8
  • Definition 4.9
  • Definition B.1
  • Theorem B.2
  • proof
  • Theorem B.3
  • ...and 3 more