MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark
William Corrias, Fabio De Gaspari, Dorjan Hitaj, Luigi V. Mancini
TL;DR
MAYA tackles the lack of standardized evaluation in generative password-guessing by providing a modular benchmarking framework that unifies data processing, model implementation, and advanced testing scenarios across eight real-world datasets. It re-implements eight state-of-the-art models (six DL-based and two ML-based) within a PyTorch-based, plug-and-play environment, enabling fair comparisons and broad characterization of model behavior under diverse conditions, including cross-community and cross-cultural generalization. The study delivers comprehensive insights: autoregressive DL models excel in in-distribution settings, traditional ML methods remain competitive on some tasks, and multi-model attacks improve coverage by leveraging complementary distributions; it also reveals that longer and more complex passwords remain challenging and that model generalization can persist across distribution shifts. By releasing MAYA publicly, the authors provide a much-needed, rigorous, and reproducible benchmark for the password-security community, supporting both improved defense mechanisms (e.g., honeywords, strength meters) and future research into more robust password-generation and detection methods.
Abstract
Recent advances in generative models have led to their application in password guessing, with the aim of replicating the complexity, structure, and patterns of human-created passwords. Despite their potential, inconsistencies and inadequate evaluation methodologies in prior research have hindered meaningful comparisons and a comprehensive, unbiased understanding of their capabilities. This paper introduces MAYA, a unified, customizable, plug-and-play benchmarking framework designed to facilitate the systematic characterization and benchmarking of generative password-guessing models in the context of trawling attacks. Using MAYA, we conduct a comprehensive assessment of six state-of-the-art approaches, which we re-implemented and adapted to ensure standardization. Our evaluation spans eight real-world password datasets and covers an exhaustive set of advanced testing scenarios, totaling over 15,000 compute hours. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, the diverse password distributions learned by the models enable a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark generative password-guessing models. Our framework is publicly available at https://github.com/williamcorrias/MAYA-Password-Benchmarking.
