Table of Contents
Fetching ...

Query Recovery from Easy to Hard: Jigsaw Attack against SSE

Hao Nie, Wei Wang, Peng Xu, Xianglong Zhang, Laurence T. Yang, Kaitai Liang

TL;DR

The paper introduces Jigsaw, a three-stage similar-data attack against SSE that exploits the distributional properties of keyword volume and frequency, along with co-occurrence information, to recover queries. It begins by locating distinctive queries, refines candidates via co-occurrence constraints, and finally recovers the remaining queries iteratively, achieving around $>90\%$ accuracy across multiple datasets and under countermeasures. The method demonstrates robustness to frequency leakage decay and outperforms prior attacks in many scenarios, challenging existing defenses such as padding and obfuscation. These results underscore significant practical risks for SSE schemes and highlight the need for stronger leakage-control mechanisms, including consideration of co-occurrence-aware defenses or stronger access-pattern protections.

Abstract

Searchable symmetric encryption schemes often unintentionally disclose certain sensitive information, such as access, volume, and search patterns. Attackers can exploit such leakages and other available knowledge related to the user's database to recover queries. We find that the effectiveness of query recovery attacks depends on the volume/frequency distribution of keywords. Queries containing keywords with high volumes/frequencies are more susceptible to recovery, even when countermeasures are implemented. Attackers can also effectively leverage these ``special'' queries to recover all others. By exploiting the above finding, we propose a Jigsaw attack that begins by accurately identifying and recovering those distinctive queries. Leveraging the volume, frequency, and co-occurrence information, our attack achieves $90\%$ accuracy in three tested datasets, which is comparable to previous attacks (Oya et al., USENIX' 22 and Damie et al., USENIX' 21). With the same runtime, our attack demonstrates an advantage over the attack proposed by Oya et al (approximately $15\%$ more accuracy when the keyword universe size is 15k). Furthermore, our proposed attack outperforms existing attacks against widely studied countermeasures, achieving roughly $60\%$ and $85\%$ accuracy against the padding and the obfuscation, respectively. In this context, with a large keyword universe ($\geq$3k), it surpasses current state-of-the-art attacks by more than $20\%$.

Query Recovery from Easy to Hard: Jigsaw Attack against SSE

TL;DR

The paper introduces Jigsaw, a three-stage similar-data attack against SSE that exploits the distributional properties of keyword volume and frequency, along with co-occurrence information, to recover queries. It begins by locating distinctive queries, refines candidates via co-occurrence constraints, and finally recovers the remaining queries iteratively, achieving around accuracy across multiple datasets and under countermeasures. The method demonstrates robustness to frequency leakage decay and outperforms prior attacks in many scenarios, challenging existing defenses such as padding and obfuscation. These results underscore significant practical risks for SSE schemes and highlight the need for stronger leakage-control mechanisms, including consideration of co-occurrence-aware defenses or stronger access-pattern protections.

Abstract

Searchable symmetric encryption schemes often unintentionally disclose certain sensitive information, such as access, volume, and search patterns. Attackers can exploit such leakages and other available knowledge related to the user's database to recover queries. We find that the effectiveness of query recovery attacks depends on the volume/frequency distribution of keywords. Queries containing keywords with high volumes/frequencies are more susceptible to recovery, even when countermeasures are implemented. Attackers can also effectively leverage these ``special'' queries to recover all others. By exploiting the above finding, we propose a Jigsaw attack that begins by accurately identifying and recovering those distinctive queries. Leveraging the volume, frequency, and co-occurrence information, our attack achieves accuracy in three tested datasets, which is comparable to previous attacks (Oya et al., USENIX' 22 and Damie et al., USENIX' 21). With the same runtime, our attack demonstrates an advantage over the attack proposed by Oya et al (approximately more accuracy when the keyword universe size is 15k). Furthermore, our proposed attack outperforms existing attacks against widely studied countermeasures, achieving roughly and accuracy against the padding and the obfuscation, respectively. In this context, with a large keyword universe (3k), it surpasses current state-of-the-art attacks by more than .
Paper Structure (32 sections, 8 equations, 19 figures, 6 tables, 3 algorithms)

This paper contains 32 sections, 8 equations, 19 figures, 6 tables, 3 algorithms.

Figures (19)

  • Figure 1: The distribution of queries on Enron. The horizontal dashed line divides the top $10\%$ queries on volume from other queries; the vertical dashed line divides the top $10\%$ queries on frequency from other queries. The blue dots denote the real queries issued by the user; the red dots denote the queries successfully recovered by the simple attack presented in Appendix \ref{['Appendix:simpleattack']}.
  • Figure 2: The accuracy of Algorithm \ref{['alg1']} in four quadrants with different $rv$ and $rf$, where we treat keywords with top-$rv\cdot l$ highest volume as high-volume keywords and treat keywords with top-$rf\cdot l$ highest frequency as high-frequency keywords (A larger $rv$ means more queries are considered as high-volume queries. Similarly, a larger $rf$ yields more queries that are categorized as high-frequency queries).
  • Figure 3: The accuracy of Algorithm \ref{['alg1']} on four quadrants with different $\alpha$, where $\alpha$ is the weight of volume and $(1-\alpha)$ is the weight of frequency in measurement.
  • Figure 4: The accuracy of Jigsaw with different $\beta$, where $beta$ is the weight of co-occurrence information and $(1-\beta)$ is the weight of volume and frequency information in calculating $score$. The left and right columns display the results with and without frequency information.
  • Figure 5: Accuracy & Time comparisons in Enron, Lucene, and Wikipedia.
  • ...and 14 more figures