Table of Contents
Fetching ...

Honeyfile Camouflage: Hiding Fake Files in Plain Sight

Roelien C. Timmer, David Liebowitz, Surya Nepal, Salil S. Kanhere

TL;DR

This paper tackles the problem of camouflaging honeyfile filenames by embedding them in semantic vector spaces and using two cosine-distance-based camouflage metrics. It introduces Simple Camouflage, based on distance to the directory mean, and Cluster Camouflage, based on a von Mises-Fisher mixture model, with performance evaluated on a GitHub filesystem dataset. The results show both metrics effectively distinguish locally sourced filenames from external samples, with Simple Camouflage offering substantially lower computational cost and comparable effectiveness. The work advances practical cyber deception by providing quantitative tools to generate believable yet stealthy honeyfile names and discusses implications for deployment environments and future testing on diverse datasets and LLM-generated content.

Abstract

Honeyfiles are a particularly useful type of honeypot: fake files deployed to detect and infer information from malicious behaviour. This paper considers the challenge of naming honeyfiles so they are camouflaged when placed amongst real files in a file system. Based on cosine distances in semantic vector spaces, we develop two metrics for filename camouflage: one based on simple averaging and one on clustering with mixture fitting. We evaluate and compare the metrics, showing that both perform well on a publicly available GitHub software repository dataset.

Honeyfile Camouflage: Hiding Fake Files in Plain Sight

TL;DR

This paper tackles the problem of camouflaging honeyfile filenames by embedding them in semantic vector spaces and using two cosine-distance-based camouflage metrics. It introduces Simple Camouflage, based on distance to the directory mean, and Cluster Camouflage, based on a von Mises-Fisher mixture model, with performance evaluated on a GitHub filesystem dataset. The results show both metrics effectively distinguish locally sourced filenames from external samples, with Simple Camouflage offering substantially lower computational cost and comparable effectiveness. The work advances practical cyber deception by providing quantitative tools to generate believable yet stealthy honeyfile names and discusses implications for deployment environments and future testing on diverse datasets and LLM-generated content.

Abstract

Honeyfiles are a particularly useful type of honeypot: fake files deployed to detect and infer information from malicious behaviour. This paper considers the challenge of naming honeyfiles so they are camouflaged when placed amongst real files in a file system. Based on cosine distances in semantic vector spaces, we develop two metrics for filename camouflage: one based on simple averaging and one on clustering with mixture fitting. We evaluate and compare the metrics, showing that both perform well on a publicly available GitHub software repository dataset.
Paper Structure (18 sections, 6 equations, 4 figures, 2 tables)

This paper contains 18 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: On the left side, a point (in black) is poorly camouflaged, whilst the point lies in the centre of all the points. On the right side, a point (in black) is well camouflaged as part of a cluster.
  • Figure 2: Distribution of the simple camouflage score for local and sampled files after normalisation for the scores per directory.
  • Figure 3: Distribution of the cluster camouflage score for local and sampled files after normalisation for the scores per directory.
  • Figure 4: Log histogram of number of items per directory.