SE#PCFG: Semantically Enhanced PCFG for Password Analysis and Cracking
Yangde Wang, Weidong Qiu, Peng Tang, Hao Tian, Shujun Li
TL;DR
This work tackles the gap in understanding semantic patterns in user-generated passwords across languages by introducing SE#PCFG, a semantically enhanced PCFG framework with 43 semantic factor types and multilingual coverage. It formalizes a four-level password model (Characters, SFs/SFTs, SPs, and Semantic Structure) and a streamlined pipeline for semantic analysis, including novel smoothing to handle unobserved patterns. Building on this, SEPCA is proposed as a semantically aware password-cracking architecture that outperforms three state-of-the-art baselines across 52 test cases, with significant improvements in user- and password-level coverage. The results yield new insights into cross-database semantic correlations and have practical implications for password policies, with robust methods for analyzing and auditing password security in multilingual settings.
Abstract
Much research has been done on user-generated textual passwords. Surprisingly, semantic information in such passwords remain under-investigated, with passwords created by English- and/or Chinese-speaking users being more studied with limited semantics. This paper fills this gap by proposing a general framework based on semantically enhanced PCFG (probabilistic context-free grammars) named SE#PCFG. It allowed us to consider 43 types of semantic information, the richest set considered so far, for password analysis. Applying SE#PCFG to 17 large leaked password databases of user speaking four languages (English, Chinese, German and French), we demonstrate its usefulness and report a wide range of new insights about password semantics at different levels such as cross-website password correlations. Furthermore, based on SE#PCFG and a new systematic smoothing method, we proposed the Semantically Enhanced Password Cracking Architecture (SEPCA), and compared its performance against three SOTA (state-of-the-art) benchmarks in terms of the password coverage rate: two other PCFG variants and neural network. Our experimental results showed that SEPCA outperformed all the three benchmarks consistently and significantly across 52 test cases, by up to 21.53%, 52.55% and 7.86%, respectively, at the user-level (with duplicate passwords). At the level of unique passwords, SEPCA also beats the three counterparts by up to 43.83%, 94.11% and 11.16%, respectively.
