Table of Contents
Fetching ...

Differentially Private Substring and Document Counting

Giulia Bernardini, Philip Bille, Inge Li Gørtz, Teresa Anna Steiner

TL;DR

This paper tackles differential privacy for pattern counting in collections of strings, focusing on Substring Count and Document Count. It introduces a private data structure that supports all query patterns with additive error bounds scaling as $O(\ell\,\mathrm{polylog}(n\ell|\Sigma|))$ under pure DP, and improves Document Count to $O(\sqrt{\ell}\,\mathrm{polylog}(n\ell|\Sigma|))$ under $(\epsilon,\delta)$-DP, while maintaining $O(n\ell^2)$ space and $O(n^2\ell^4)$ preprocessing. A central technical device is counting on trees via heavy-path decomposition: the algorithm prunes a candidate set of substrings, builds a trie, privately estimates root counts, and privately aggregates along heavy paths using a generalized binary-tree mechanism, yielding tight error bounds up to polylog factors. The work also yields applications to private frequent substring mining and $q$-gram extraction, and provides a lower bound that confirms near-optimality of the proposed guarantees. Overall, the paper advances the theoretical understanding of private substring and document counting and offers practical, scalable DP data structures for large document collections.

Abstract

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an $ε$-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of $O(\ell \cdot\mathrm{polylog}(n\ell|Σ|))$, where $\ell$ is the maximum length of a document in the database, $n$ is the number of documents, and $|Σ|$ is the size of the alphabet. We show that this is optimal up to a $O(\mathrm{polylog}(n\ell))$ factor. Further, we show that for $(ε,δ)$-differential privacy, the bound for document counting can be improved to $O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|Σ|))$. Additionally, our data structures are efficient. In particular, our data structures use $O(n\ell^2)$ space, $O(n^2\ell^4)$ preprocessing time, and $O(|P|)$ query time where $P$ is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and $q$-grams. For $q$-grams, we further improve the preprocessing time of the data structure.

Differentially Private Substring and Document Counting

TL;DR

This paper tackles differential privacy for pattern counting in collections of strings, focusing on Substring Count and Document Count. It introduces a private data structure that supports all query patterns with additive error bounds scaling as under pure DP, and improves Document Count to under -DP, while maintaining space and preprocessing. A central technical device is counting on trees via heavy-path decomposition: the algorithm prunes a candidate set of substrings, builds a trie, privately estimates root counts, and privately aggregates along heavy paths using a generalized binary-tree mechanism, yielding tight error bounds up to polylog factors. The work also yields applications to private frequent substring mining and -gram extraction, and provides a lower bound that confirms near-optimality of the proposed guarantees. Overall, the paper advances the theoretical understanding of private substring and document counting and offers practical, scalable DP data structures for large document collections.

Abstract

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an -differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of , where is the maximum length of a document in the database, is the number of documents, and is the size of the alphabet. We show that this is optimal up to a factor. Further, we show that for -differential privacy, the bound for document counting can be improved to . Additionally, our data structures are efficient. In particular, our data structures use space, preprocessing time, and query time where is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and -grams. For -grams, we further improve the preprocessing time of the data structure.

Paper Structure

This paper contains 27 sections, 38 theorems, 26 equations.

Key Result

Theorem 1

Let $n$ and $\ell$ be integers and $\Sigma$ an alphabet of size $|\Sigma|$. Let $\Delta\leq \ell$. For any $\epsilon>0$ and $0<\beta<1$, there exists an $\epsilon$-differentially private algorithm, which can process any database $\mathcal{D}=S_1,\dots,S_{n}$ of documents in $\Sigma^{[1,\ell]}$ and w

Theorems & Definitions (63)

  • Definition 1: Differential Privacy
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Definition 2: $L_p$-sensitivity
  • Definition 3
  • Corollary 1
  • Lemma 1: Simple Composition DBLP:conf/stoc/DworkL09
  • ...and 53 more