Table of Contents
Fetching ...

Contextual Chart Generation for Cyber Deception

David D. Nguyen, David Liebowitz, Surya Nepal, Salil S. Kanhere, Sharif Abuadbba

TL;DR

This work tackles the realism gap in honeyfiles by focusing on document-embedded charts. It proposes HoneyPlotNet, a unified architecture that combines a multimodal multitask Transformer with a specialized Plot Data Model based on a multi-head VQVAE to generate coherent captions and chart data conditioned on long document text. A new document-chart dataset (5418 pairs) and a novel Keyword Semantic Matching (KSM) metric are released, enabling robust benchmarking against large language models. Experiments show that HoneyPlotNet achieves superior semantic alignment and data realism (notably with HPN-T5) compared to baselines like ChatGPT and GPT-4, indicating practical potential for scalable, realistic honeyplot generation in cyber deception. These contributions advance defensive deception capabilities by delivering scalable chart generation aligned with document context and provide open resources for further research.

Abstract

Honeyfiles are security assets designed to attract and detect intruders on compromised systems. Honeyfiles are a type of honeypot that mimic real, sensitive documents, creating the illusion of the presence of valuable data. Interaction with a honeyfile reveals the presence of an intruder, and can provide insights into their goals and intentions. Their practical use, however, is limited by the time, cost and effort associated with manually creating realistic content. The introduction of large language models has made high-quality text generation accessible, but honeyfiles contain a variety of content including charts, tables and images. This content needs to be plausible and realistic, as well as semantically consistent both within honeyfiles and with the real documents they mimic, to successfully deceive an intruder. In this paper, we focus on an important component of the honeyfile content generation problem: document charts. Charts are ubiquitous in corporate documents and are commonly used to communicate quantitative and scientific data. Existing image generation models, such as DALL-E, are rather prone to generating charts with incomprehensible text and unconvincing data. We take a multi-modal approach to this problem by combining two purpose-built generative models: a multitask Transformer and a specialized multi-head autoencoder. The Transformer generates realistic captions and plot text, while the autoencoder generates the underlying tabular data for the plot. To advance the field of automated honeyplot generation, we also release a new document-chart dataset and propose a novel metric Keyword Semantic Matching (KSM). This metric measures the semantic consistency between keywords of a corpus and a smaller bag of words. Extensive experiments demonstrate excellent performance against multiple large language models, including ChatGPT and GPT4.

Contextual Chart Generation for Cyber Deception

TL;DR

This work tackles the realism gap in honeyfiles by focusing on document-embedded charts. It proposes HoneyPlotNet, a unified architecture that combines a multimodal multitask Transformer with a specialized Plot Data Model based on a multi-head VQVAE to generate coherent captions and chart data conditioned on long document text. A new document-chart dataset (5418 pairs) and a novel Keyword Semantic Matching (KSM) metric are released, enabling robust benchmarking against large language models. Experiments show that HoneyPlotNet achieves superior semantic alignment and data realism (notably with HPN-T5) compared to baselines like ChatGPT and GPT-4, indicating practical potential for scalable, realistic honeyplot generation in cyber deception. These contributions advance defensive deception capabilities by delivering scalable chart generation aligned with document context and provide open resources for further research.

Abstract

Honeyfiles are security assets designed to attract and detect intruders on compromised systems. Honeyfiles are a type of honeypot that mimic real, sensitive documents, creating the illusion of the presence of valuable data. Interaction with a honeyfile reveals the presence of an intruder, and can provide insights into their goals and intentions. Their practical use, however, is limited by the time, cost and effort associated with manually creating realistic content. The introduction of large language models has made high-quality text generation accessible, but honeyfiles contain a variety of content including charts, tables and images. This content needs to be plausible and realistic, as well as semantically consistent both within honeyfiles and with the real documents they mimic, to successfully deceive an intruder. In this paper, we focus on an important component of the honeyfile content generation problem: document charts. Charts are ubiquitous in corporate documents and are commonly used to communicate quantitative and scientific data. Existing image generation models, such as DALL-E, are rather prone to generating charts with incomprehensible text and unconvincing data. We take a multi-modal approach to this problem by combining two purpose-built generative models: a multitask Transformer and a specialized multi-head autoencoder. The Transformer generates realistic captions and plot text, while the autoencoder generates the underlying tabular data for the plot. To advance the field of automated honeyplot generation, we also release a new document-chart dataset and propose a novel metric Keyword Semantic Matching (KSM). This metric measures the semantic consistency between keywords of a corpus and a smaller bag of words. Extensive experiments demonstrate excellent performance against multiple large language models, including ChatGPT and GPT4.
Paper Structure (41 sections, 2 equations, 5 figures, 5 tables)

This paper contains 41 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Honeyplots generated by different image and language models using the same prompt/caption: "Age and gender difference of disability, institutionalization and death (n = 1560)".
  • Figure 2: Architecture overview of the HoneyPlotNet. The document text is fed into a Transformer language model to generate captions (dotted lines). The captions are fed back to generate chart text and data tokens (solid lines). Tokens are passed into the Plot Data Model, which generates the continuous chart data.
  • Figure 3: Overview of the Plot Data Model, which combines the VQ framework with multi-head encoder and decoder. This model is responsible for generating continuous data values for multiple chart types. See Section \ref{['sec:encoder']} for the encoder and Section \ref{['sec:decoder']} for the decoder.
  • Figure 4: Preprocessing and multi-task learning framework.
  • Figure 5: A comparison of original chart data to model reconstruction. GPT4 is poor at determining the correct data range, which leads to low FID scores.