Leakage-abuse Attack Against Substring-SSE with Partially Known Dataset
Xijie Ba, Qin Liu, Xiaohong Li, Jianting Ning
TL;DR
This paper addresses privacy risks in substring-SSE by presenting the first leakage-abuse attack under partially known datasets. It extends the LEAP framework with a matrix-based correlation approach that leverages a suffix-tree–based substring index to recover plaintext substrings, mapping encrypted tokens to alphabets and substrings via iterative column/row mappings. Experimental evaluation on the Enron corpus shows strong recovery performance, achieving up to $97.87\%$ alphabet and $98.32\%$ string recovery with $50\%$ auxiliary knowledge, and complete recovery at $60\%$ knowledge, while exhibiting robustness to dataset size (degradation $<5\%$ up to $30{,}000$ strings). These results reveal substantial privacy risks in current substring-SSE designs and underscore the urgent need for leakage-resilient constructions and defenses.
Abstract
Substring-searchable symmetric encryption (substring-SSE) has become increasingly critical for privacy-preserving applications in cloud systems. However, existing schemes remain vulnerable to information leakage during search operations, particularly when adversaries possess partial knowledge of the target dataset. Although leakage-abuse attacks have been widely studied for traditional SSE, their applicability to substring-SSE under partially known data assumptions remains unexplored. In this paper, we present the first leakage-abuse attack on substring-SSE under partially-known dataset conditions. We develop a novel matrix-based correlation technique that extends and optimizes the LEAP framework for substring-SSE, enabling efficient recovery of plaintext data from encrypted suffix tree structures. Unlike existing approaches that rely on independent auxiliary datasets, our method directly exploits known data fragments to establish high-confidence mappings between ciphertext tokens and plaintext substrings through iterative matrix transformations. Comprehensive experiments on real-world datasets demonstrate the effectiveness of the attack, with recovery rates reaching 98.32% for substrings given 50% auxiliary knowledge. Even with only 10% prior knowledge, the attack achieves 74.42% substring recovery while maintaining strong scalability across datasets of varying sizes. The result reveals significant privacy risks in current substring-SSE designs and highlights the urgent need for leakage-resilient constructions.
