Lost in Disclosure: On The Inference of Password Composition Policies
Saul Johnson, João Ferreira, Alexandra Mendes, Julien Cordry
TL;DR
Lost in Disclosure presents a practical approach to infer password composition policies from breached password datasets lacking explicit policy labels. It reframes policy inference as outlier detection over attributes such as $length$ and $digits$, operationalized via $mult(l)=\frac{cum(l+1)}{cum(l)}$ with a threshold $c=2$, and implemented in the pol-infer tool. Across real datasets (RockYou, Yahoo, 000webhost, LinkedIn) and synthetic tests (padding and formatting errors), the method recovers known minimum-length constraints and detects absence of certain requirements, corroborating prior literature while illustrating robustness to noise. This work enables safer, policy-agnostic use of password data for research and highlights practical steps for handling noisy disclosures in security datasets.
Abstract
Large-scale password data breaches are becoming increasingly commonplace, which has enabled researchers to produce a substantial body of password security research utilising real-world password datasets, which often contain numbers of records in the tens or even hundreds of millions. While much study has been conducted on how password composition policies (sets of rules that a user must abide by when creating a password) influence the distribution of user-chosen passwords on a system, much less research has been done on inferring the password composition policy that a given set of user-chosen passwords was created under. In this paper, we state the problem with the naive approach to this challenge, and suggest a simple approach that produces more reliable results. We also present pol-infer, a tool that implements this approach, and demonstrates its use in inferring password composition policies.
