Table of Contents
Fetching ...

Fast and Optimal Differentially Private Frequent-Substring Mining

Peaker Guo, Rayne Holland, Hao Wu

TL;DR

This work presents a new $\varepsilon$-differentially private algorithm that retains the same near-optimal error guarantees while reducing space complexity to $O(n \ell+ |\Sigma| )$ and time complexity to $O(n \ell\log |\Sigma| + |\Sigma| )$ for input alphabet $\Sigma$.

Abstract

Given a dataset of $n$ user-contributed strings, each of length at most $\ell$, a key problem is how to identify all frequent substrings while preserving each user's privacy. Recent work by Bernardini et al. (PODS'25) introduced a $\varepsilon$-differentially private algorithm achieving near-optimal error, but at the prohibitive cost of $O(n^2\ell^4)$ space and processing time. In this work, we present a new $\varepsilon$-differentially private algorithm that retains the same near-optimal error guarantees while reducing space complexity to $O(n \ell+ |Σ| )$ and time complexity to $O(n \ell\log |Σ| + |Σ| )$, for input alphabet $Σ$. Our approach builds on a top-down exploration of candidate substrings but introduces two new innovations: (i) a refined candidate-generation strategy that leverages the structural properties of frequent prefixes and suffixes, and (ii) pruning of the search space guided by frequency relations. These techniques eliminate the quadratic blow-ups inherent in prior work, enabling scalable frequent substring mining under differential privacy.

Fast and Optimal Differentially Private Frequent-Substring Mining

TL;DR

This work presents a new -differentially private algorithm that retains the same near-optimal error guarantees while reducing space complexity to and time complexity to for input alphabet .

Abstract

Given a dataset of user-contributed strings, each of length at most , a key problem is how to identify all frequent substrings while preserving each user's privacy. Recent work by Bernardini et al. (PODS'25) introduced a -differentially private algorithm achieving near-optimal error, but at the prohibitive cost of space and processing time. In this work, we present a new -differentially private algorithm that retains the same near-optimal error guarantees while reducing space complexity to and time complexity to , for input alphabet . Our approach builds on a top-down exploration of candidate substrings but introduces two new innovations: (i) a refined candidate-generation strategy that leverages the structural properties of frequent prefixes and suffixes, and (ii) pruning of the search space guided by frequency relations. These techniques eliminate the quadratic blow-ups inherent in prior work, enabling scalable frequent substring mining under differential privacy.
Paper Structure (23 sections, 16 theorems, 41 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 23 sections, 16 theorems, 41 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Theorem 1.1

Let $\mathcal{D}$ be a dataset of $n$ user-contributed strings over an alphabet $\Sigma$, each of length at most $\ell \in \mathbb{N}_+$. Let $\varepsilon > 0$ and $\beta \in (0,1)$. There exists an $\varepsilon$-differentially private algorithm that, with probability at least $1 - \beta$, outputs a

Figures (2)

  • Figure 1: Illustration of Lemma \ref{['lem:frequent_sub']}, with strings in $\mathcal{C}_{k}$ highlighted.
  • Figure 2: Example of a phase $i=2$ in \ref{['alg:DPFS']} with $\mathcal{D} = \{ \texttt{CGCA}, \texttt{CGCA}, \texttt{CATA}\}$, $\tau_{\bot}= 2$, and block encoding $E$ specified by $\texttt{A} \mapsto \texttt{00\$}$, $\texttt{C} \mapsto \texttt{01\$}$, $\texttt{G} \mapsto \texttt{10\$}$, and $\texttt{T} \mapsto \texttt{11\$}$. Here, $r = 3$ and $k = 6$. Left: $T_6$, the $r$-spaced sparse suffix tree of $\mathcal{C}_{6} = \{ \texttt{01\$00\$}, \texttt{01\$10\$}, \texttt{10\$01\$} \} \equiv \{ \texttt{CG}, \texttt{GC}, \texttt{CA} \}$ with offset $1$ and round $3$ (i.e., suffixes starting at positions 1 and 4). Right: the traversal on $s \circ T_6$ for each $s \in \mathcal{C}_{6}$. Each search begins at the root of the subtree $T_6$. At each node $u \in s \circ T_6$ encountered during the traversal, a noisy frequency count $\Tilde{f}= \tilde{f}_{\mathcal{D}}{( {\texttt{str}(u)\xspace} )}$ is computed (see Section \ref{['sec:noisy_counts']} for details). If $\Tilde{f}\geq \tau_{\bot}$, then $\texttt{str}(u)\xspace$ is added to $\mathcal{C}_{6+t}$, where $6+t=|\texttt{str}(u)\xspace| \in \{9, 12\}$ in this example. Red nodes denote substrings with noisy frequency $\geq$$\tau_{\bot}$, i.e., substrings in $\mathcal{C}_{9} = \{ \texttt{01\$10\$01\$}, \texttt{10\$01\$00\$} \} \equiv \{ \texttt{CGC}, \texttt{GCA} \}$ and $\mathcal{C}_{12} = \{ \texttt{01\$10\$01\$00\$} \} \equiv \{ \texttt{CGCA} \}$. Blue nodes (solid dots) and loci within edges (hollow dots) denote points where the search is pruned.

Theorems & Definitions (30)

  • Theorem 1.1: Informal Version of Theorem \ref{['thm:main']}
  • Definition 2.1: $\varepsilon$-Indistinguishability
  • Definition 2.2: $\varepsilon$-Differential Privacy DworkMNS06
  • Definition 2.3
  • Definition 2.4: Inclusion-Exclusion Criterion
  • Lemma 3.1: Composition DR14
  • Lemma 3.2: Tail Bound for Laplace Noise
  • Lemma 3.3: Laplace Mechanism DworkMNS06
  • Lemma 3.4: Binary Tree Mechanism ChanSS11DworkNPR10
  • Definition 3.5: $r$-Spaced Sparse Suffix Tree
  • ...and 20 more