Table of Contents
Fetching ...

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, Ee-Chien Chang

TL;DR

The paper examines whether large language models can assist or undermine cybersecurity education by solving Cisco certification questions and CTF challenges. It compares OpenAI ChatGPT, Google Bard, and Microsoft Bing, finding that ChatGPT generally performs best on factual certification questions while all models show limitations on conceptual questions; in CTF tasks, ChatGPT solves the majority of test cases, Bard fewer, and Bing the least, with jailbreaking prompts demonstrating a potential to bypass safeguards. The study highlights both the pedagogical value and integrity risks of AI-assisted learning in cybersecurity, and it suggests the need for educators to adapt curricula and safety policies accordingly. Overall, the work provides a baseline for AI-aided CTF and certification activities and motivates further research on safe, effective integration of generative AI into security education.

Abstract

The assessment of cybersecurity Capture-The-Flag (CTF) exercises involves participants finding text strings or ``flags'' by exploiting system vulnerabilities. Large Language Models (LLMs) are natural-language models trained on vast amounts of words to understand and generate text; they can perform well on many CTF challenges. Such LLMs are freely available to students. In the context of CTF exercises in the classroom, this raises concerns about academic integrity. Educators must understand LLMs' capabilities to modify their teaching to accommodate generative AI assistance. This research investigates the effectiveness of LLMs, particularly in the realm of CTF challenges and questions. Here we evaluate three popular LLMs, OpenAI ChatGPT, Google Bard, and Microsoft Bing. First, we assess the LLMs' question-answering performance on five Cisco certifications with varying difficulty levels. Next, we qualitatively study the LLMs' abilities in solving CTF challenges to understand their limitations. We report on the experience of using the LLMs for seven test cases in all five types of CTF challenges. In addition, we demonstrate how jailbreak prompts can bypass and break LLMs' ethical safeguards. The paper concludes by discussing LLM's impact on CTF exercises and its implications.

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

TL;DR

The paper examines whether large language models can assist or undermine cybersecurity education by solving Cisco certification questions and CTF challenges. It compares OpenAI ChatGPT, Google Bard, and Microsoft Bing, finding that ChatGPT generally performs best on factual certification questions while all models show limitations on conceptual questions; in CTF tasks, ChatGPT solves the majority of test cases, Bard fewer, and Bing the least, with jailbreaking prompts demonstrating a potential to bypass safeguards. The study highlights both the pedagogical value and integrity risks of AI-assisted learning in cybersecurity, and it suggests the need for educators to adapt curricula and safety policies accordingly. Overall, the work provides a baseline for AI-aided CTF and certification activities and motivates further research on safe, effective integration of generative AI into security education.

Abstract

The assessment of cybersecurity Capture-The-Flag (CTF) exercises involves participants finding text strings or ``flags'' by exploiting system vulnerabilities. Large Language Models (LLMs) are natural-language models trained on vast amounts of words to understand and generate text; they can perform well on many CTF challenges. Such LLMs are freely available to students. In the context of CTF exercises in the classroom, this raises concerns about academic integrity. Educators must understand LLMs' capabilities to modify their teaching to accommodate generative AI assistance. This research investigates the effectiveness of LLMs, particularly in the realm of CTF challenges and questions. Here we evaluate three popular LLMs, OpenAI ChatGPT, Google Bard, and Microsoft Bing. First, we assess the LLMs' question-answering performance on five Cisco certifications with varying difficulty levels. Next, we qualitatively study the LLMs' abilities in solving CTF challenges to understand their limitations. We report on the experience of using the LLMs for seven test cases in all five types of CTF challenges. In addition, we demonstrate how jailbreak prompts can bypass and break LLMs' ethical safeguards. The paper concludes by discussing LLM's impact on CTF exercises and its implications.
Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Investigating if large language models (e.g., OpenAI ChatGPT, Google Bard, Microsoft Bing) can aid participants in CTF test environments and solving challenges.
  • Figure 2: The network topology in a small web server environment for Web Security CTF challenge---Shell Shock Attack.
  • Figure 3: AIM using creative prompts to trick ChatGPTinto bypassing its safety policy and providing information about security exploits against a target server.