Table of Contents
Fetching ...

SE#PCFG: Semantically Enhanced PCFG for Password Analysis and Cracking

Yangde Wang, Weidong Qiu, Peng Tang, Hao Tian, Shujun Li

TL;DR

This work tackles the gap in understanding semantic patterns in user-generated passwords across languages by introducing SE#PCFG, a semantically enhanced PCFG framework with 43 semantic factor types and multilingual coverage. It formalizes a four-level password model (Characters, SFs/SFTs, SPs, and Semantic Structure) and a streamlined pipeline for semantic analysis, including novel smoothing to handle unobserved patterns. Building on this, SEPCA is proposed as a semantically aware password-cracking architecture that outperforms three state-of-the-art baselines across 52 test cases, with significant improvements in user- and password-level coverage. The results yield new insights into cross-database semantic correlations and have practical implications for password policies, with robust methods for analyzing and auditing password security in multilingual settings.

Abstract

Much research has been done on user-generated textual passwords. Surprisingly, semantic information in such passwords remain under-investigated, with passwords created by English- and/or Chinese-speaking users being more studied with limited semantics. This paper fills this gap by proposing a general framework based on semantically enhanced PCFG (probabilistic context-free grammars) named SE#PCFG. It allowed us to consider 43 types of semantic information, the richest set considered so far, for password analysis. Applying SE#PCFG to 17 large leaked password databases of user speaking four languages (English, Chinese, German and French), we demonstrate its usefulness and report a wide range of new insights about password semantics at different levels such as cross-website password correlations. Furthermore, based on SE#PCFG and a new systematic smoothing method, we proposed the Semantically Enhanced Password Cracking Architecture (SEPCA), and compared its performance against three SOTA (state-of-the-art) benchmarks in terms of the password coverage rate: two other PCFG variants and neural network. Our experimental results showed that SEPCA outperformed all the three benchmarks consistently and significantly across 52 test cases, by up to 21.53%, 52.55% and 7.86%, respectively, at the user-level (with duplicate passwords). At the level of unique passwords, SEPCA also beats the three counterparts by up to 43.83%, 94.11% and 11.16%, respectively.

SE#PCFG: Semantically Enhanced PCFG for Password Analysis and Cracking

TL;DR

This work tackles the gap in understanding semantic patterns in user-generated passwords across languages by introducing SE#PCFG, a semantically enhanced PCFG framework with 43 semantic factor types and multilingual coverage. It formalizes a four-level password model (Characters, SFs/SFTs, SPs, and Semantic Structure) and a streamlined pipeline for semantic analysis, including novel smoothing to handle unobserved patterns. Building on this, SEPCA is proposed as a semantically aware password-cracking architecture that outperforms three state-of-the-art baselines across 52 test cases, with significant improvements in user- and password-level coverage. The results yield new insights into cross-database semantic correlations and have practical implications for password policies, with robust methods for analyzing and auditing password security in multilingual settings.

Abstract

Much research has been done on user-generated textual passwords. Surprisingly, semantic information in such passwords remain under-investigated, with passwords created by English- and/or Chinese-speaking users being more studied with limited semantics. This paper fills this gap by proposing a general framework based on semantically enhanced PCFG (probabilistic context-free grammars) named SE#PCFG. It allowed us to consider 43 types of semantic information, the richest set considered so far, for password analysis. Applying SE#PCFG to 17 large leaked password databases of user speaking four languages (English, Chinese, German and French), we demonstrate its usefulness and report a wide range of new insights about password semantics at different levels such as cross-website password correlations. Furthermore, based on SE#PCFG and a new systematic smoothing method, we proposed the Semantically Enhanced Password Cracking Architecture (SEPCA), and compared its performance against three SOTA (state-of-the-art) benchmarks in terms of the password coverage rate: two other PCFG variants and neural network. Our experimental results showed that SEPCA outperformed all the three benchmarks consistently and significantly across 52 test cases, by up to 21.53%, 52.55% and 7.86%, respectively, at the user-level (with duplicate passwords). At the level of unique passwords, SEPCA also beats the three counterparts by up to 43.83%, 94.11% and 11.16%, respectively.
Paper Structure (26 sections, 4 equations, 6 figures, 10 tables)

This paper contains 26 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Distribution of combined SFTs in the 17 databases. We can see a clear vision that English, German and French databases have similar distribution at SFT-level except for 10 (MyHeritage). Meanwhile, Chinese databases have similar distribution with each other, but quite different from the other databases. All numbers labeled in each figure are on average.
  • Figure 2: Distributions of SPL in the 17 databases
  • Figure 3: Cross-database semantic correlation values at the SFT level and those at the combined SF-SFT level, according to Han-DM-book2012 and Eq. \ref{['eq:cos_similarity_SF-SFT']}, respectively. The x- and y-axis show the indices of the 17 databases shown in Table \ref{['tab:PasswordDatabases']}.
  • Figure 4: Performance using Monte-Carlo (MC) estimation and real-attacks (RA).
  • Figure 5: Performance comparison between SEPCA and DPG over all testing sets. SEPCA, DPG Pasquini-SP-2021.
  • ...and 1 more figures