Table of Contents
Fetching ...

Act as a Honeytoken Generator! An Investigation into Honeytoken Generation with Large Language Models

Daniel Reti, Norman Becker, Tillmann Angeli, Anasuya Chattopadhyay, Daniel Schneider, Sebastian Vollmer, Hans D. Schotten

TL;DR

The paper investigates scalable honeytoken generation using large language models (LLMs) to automate the creation of diverse deceptive data. It introduces a modular prompt-building framework and evaluates seven honeytoken types, focusing on robots.txt and honeywords with 210 prompts tested across GPT-3.5, GPT-4, LLaMA2, and Gemini. Key findings show that prompt effectiveness does not fully generalize across models, with GPT-3.5 producing particularly plausible honeywords that reduce distinguishability relative to prior methods (approximately 14–15% attacker success vs higher baselines). Overall, the work demonstrates the feasibility of generic, LLM-driven honeytoken generation and highlights cross-model variability, practical benefits, and limitations that guide future research in cyber deception and defense tooling.

Abstract

With the increasing prevalence of security incidents, the adoption of deception-based defense strategies has become pivotal in cyber security. This work addresses the challenge of scalability in designing honeytokens, a key component of such defense mechanisms. The manual creation of honeytokens is a tedious task. Although automated generators exists, they often lack versatility, being specialized for specific types of honeytokens, and heavily rely on suitable training datasets. To overcome these limitations, this work systematically investigates the approach of utilizing Large Language Models (LLMs) to create a variety of honeytokens. Out of the seven different honeytoken types created in this work, such as configuration files, databases, and log files, two were used to evaluate the optimal prompt. The generation of robots.txt files and honeywords was used to systematically test 210 different prompt structures, based on 16 prompt building blocks. Furthermore, all honeytokens were tested across different state-of-the-art LLMs to assess the varying performance of different models. Prompts performing optimally on one LLMs do not necessarily generalize well to another. Honeywords generated by GPT-3.5 were found to be less distinguishable from real passwords compared to previous methods of automated honeyword generation. Overall, the findings of this work demonstrate that generic LLMs are capable of creating a wide array of honeytokens using the presented prompt structures.

Act as a Honeytoken Generator! An Investigation into Honeytoken Generation with Large Language Models

TL;DR

The paper investigates scalable honeytoken generation using large language models (LLMs) to automate the creation of diverse deceptive data. It introduces a modular prompt-building framework and evaluates seven honeytoken types, focusing on robots.txt and honeywords with 210 prompts tested across GPT-3.5, GPT-4, LLaMA2, and Gemini. Key findings show that prompt effectiveness does not fully generalize across models, with GPT-3.5 producing particularly plausible honeywords that reduce distinguishability relative to prior methods (approximately 14–15% attacker success vs higher baselines). Overall, the work demonstrates the feasibility of generic, LLM-driven honeytoken generation and highlights cross-model variability, practical benefits, and limitations that guide future research in cyber deception and defense tooling.

Abstract

With the increasing prevalence of security incidents, the adoption of deception-based defense strategies has become pivotal in cyber security. This work addresses the challenge of scalability in designing honeytokens, a key component of such defense mechanisms. The manual creation of honeytokens is a tedious task. Although automated generators exists, they often lack versatility, being specialized for specific types of honeytokens, and heavily rely on suitable training datasets. To overcome these limitations, this work systematically investigates the approach of utilizing Large Language Models (LLMs) to create a variety of honeytokens. Out of the seven different honeytoken types created in this work, such as configuration files, databases, and log files, two were used to evaluate the optimal prompt. The generation of robots.txt files and honeywords was used to systematically test 210 different prompt structures, based on 16 prompt building blocks. Furthermore, all honeytokens were tested across different state-of-the-art LLMs to assess the varying performance of different models. Prompts performing optimally on one LLMs do not necessarily generalize well to another. Honeywords generated by GPT-3.5 were found to be less distinguishable from real passwords compared to previous methods of automated honeyword generation. Overall, the findings of this work demonstrate that generic LLMs are capable of creating a wide array of honeytokens using the presented prompt structures.
Paper Structure (17 sections, 1 equation, 2 figures, 6 tables)

This paper contains 17 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Analysis of change of score if one parameter is present
  • Figure 2: Hit rate of real password detection algorithm depending on maximal allowed login attempts per user and maximal total login attempts. Each color represents a different size of the training set. 1000 examples were presented, each example consisting of 1 real password and 19 honeywords generated with ChatGPT.