Table of Contents
Fetching ...

CIPHER: Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher

Derry Pratama, Naufal Suryanto, Andro Aprila Adiputra, Thi-Thu-Huong Le, Ahmada Yusril Kadiptya, Muhammad Iqbal, Howon Kim

TL;DR

CIPHER, a large language model specifically trained to assist in penetration testing tasks as a chatbot, fills a significant gap in traditional cybersecurity Q&A benchmarks and provides a realistic and rigorous standard for evaluating LLM’s technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios.

Abstract

Penetration testing, a critical component of cybersecurity, typically requires extensive time and effort to find vulnerabilities. Beginners in this field often benefit from collaborative approaches with the community or experts. To address this, we develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. Additionally, we introduced the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark tailored for large language models. This approach fills a significant gap in traditional cybersecurity Q\&A benchmarks and provides a realistic and rigorous standard for evaluating AI's technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios. In our assessments, CIPHER achieved the best overall performance in providing accurate suggestion responses compared to other open-source penetration testing models of similar size and even larger state-of-the-art models like Llama 3 70B and Qwen1.5 72B Chat, particularly on insane difficulty machine setups. This demonstrates that the current capabilities of general LLMs are insufficient for effectively guiding users through the penetration testing process. We also discuss the potential for improvement through scaling and the development of better benchmarks using FARR Flow augmentation results. Our benchmark will be released publicly at https://github.com/ibndias/CIPHER.

CIPHER: Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher

TL;DR

CIPHER, a large language model specifically trained to assist in penetration testing tasks as a chatbot, fills a significant gap in traditional cybersecurity Q&A benchmarks and provides a realistic and rigorous standard for evaluating LLM’s technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios.

Abstract

Penetration testing, a critical component of cybersecurity, typically requires extensive time and effort to find vulnerabilities. Beginners in this field often benefit from collaborative approaches with the community or experts. To address this, we develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. Additionally, we introduced the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark tailored for large language models. This approach fills a significant gap in traditional cybersecurity Q\&A benchmarks and provides a realistic and rigorous standard for evaluating AI's technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios. In our assessments, CIPHER achieved the best overall performance in providing accurate suggestion responses compared to other open-source penetration testing models of similar size and even larger state-of-the-art models like Llama 3 70B and Qwen1.5 72B Chat, particularly on insane difficulty machine setups. This demonstrates that the current capabilities of general LLMs are insufficient for effectively guiding users through the penetration testing process. We also discuss the potential for improvement through scaling and the development of better benchmarks using FARR Flow augmentation results. Our benchmark will be released publicly at https://github.com/ibndias/CIPHER.
Paper Structure (42 sections, 27 figures, 10 tables, 1 algorithm)

This paper contains 42 sections, 27 figures, 10 tables, 1 algorithm.

Figures (27)

  • Figure S1: CIPHER development methodology overview for training and novel automated evaluation.
  • Figure S2: The architecture of CIPHER deployed as chat assistant. (1) The user submits a query and known information, which is converted into text embeddings (2). FAISS VectorDB performs cosine similarity matching with the knowledge database (3). The reranker filters and reorders results based on relevance (4). The top-ranked response is then used to build reference (5), leading to the generation of the best suggestion for penetration steps (6). The user executes this suggestion on the attacker machine (7).
  • Figure S3: Each machine write-up is chunked into smaller pieces to extract multiple conversations, concatenated into a complete penetration testing session dialogue.
  • Figure S4: Synthetic conversation between Newbie Pentester (NP) and Expert Pentester (EP) generated from real write-up chunk. The expert's experience is reflected in the specific solution, demonstrating knowledge typically gained through practice rather than textbook learning.
  • Figure S5: Two CIPHER training dataset structures: context and conversation type. Note that the '#' represents header in markdown format while the other is in ChatML format.
  • ...and 22 more figures