Table of Contents
Fetching ...

Towards Supporting Penetration Testing Education with Large Language Models: an Evaluation and Comparison

Martin Nizon-Deladoeuille, Brynjólfur Stefánsson, Helmut Neukirchen, Thomas Welsh

TL;DR

The paper addresses how to scale penetration testing education using large language models (LLMs) by conducting a comparative evaluation of six LLMs across 15 pentesting tasks on Metasploitable v3 and OWASP WebGoat. It/systematically measures performance with a multi-criterion rubric and expert validation, finding that GPT-4o mini and GPT-4o are the most reliable for educational settings, with WhiteRabbitNeo offering complementary domain-specific capabilities. The results guide educators in selecting LLMs for classroom use and highlight the trade-offs between general-purpose and domain-specific models. Limitations include a small model set and single-response evaluation, motivating further studies with larger cohorts and varied prompting strategies.

Abstract

Cybersecurity education is challenging and it is helpful for educators to understand Large Language Models' (LLMs') capabilities for supporting education. This study evaluates the effectiveness of LLMs in conducting a variety of penetration testing tasks. Fifteen representative tasks were selected to cover a comprehensive range of real-world scenarios. We evaluate the performance of 6 models (GPT-4o mini, GPT-4o, Gemini 1.5 Flash, Llama 3.1 405B, Mixtral 8x7B and WhiteRabbitNeo) upon the Metasploitable v3 Ubuntu image and OWASP WebGOAT. Our findings suggest that GPT-4o mini currently offers the most consistent support making it a valuable tool for educational purposes. However, its use in conjonction with WhiteRabbitNeo should be considered, because of its innovative approach to tool and command recommendations. This study underscores the need for continued research into optimising LLMs for complex, domain-specific tasks in cybersecurity education.

Towards Supporting Penetration Testing Education with Large Language Models: an Evaluation and Comparison

TL;DR

The paper addresses how to scale penetration testing education using large language models (LLMs) by conducting a comparative evaluation of six LLMs across 15 pentesting tasks on Metasploitable v3 and OWASP WebGoat. It/systematically measures performance with a multi-criterion rubric and expert validation, finding that GPT-4o mini and GPT-4o are the most reliable for educational settings, with WhiteRabbitNeo offering complementary domain-specific capabilities. The results guide educators in selecting LLMs for classroom use and highlight the trade-offs between general-purpose and domain-specific models. Limitations include a small model set and single-response evaluation, motivating further studies with larger cohorts and varied prompting strategies.

Abstract

Cybersecurity education is challenging and it is helpful for educators to understand Large Language Models' (LLMs') capabilities for supporting education. This study evaluates the effectiveness of LLMs in conducting a variety of penetration testing tasks. Fifteen representative tasks were selected to cover a comprehensive range of real-world scenarios. We evaluate the performance of 6 models (GPT-4o mini, GPT-4o, Gemini 1.5 Flash, Llama 3.1 405B, Mixtral 8x7B and WhiteRabbitNeo) upon the Metasploitable v3 Ubuntu image and OWASP WebGOAT. Our findings suggest that GPT-4o mini currently offers the most consistent support making it a valuable tool for educational purposes. However, its use in conjonction with WhiteRabbitNeo should be considered, because of its innovative approach to tool and command recommendations. This study underscores the need for continued research into optimising LLMs for complex, domain-specific tasks in cybersecurity education.

Paper Structure

This paper contains 5 sections, 2 tables.