Table of Contents
Fetching ...

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

Cristian Curaba, Denis D'Ambrosi, Alessandro Minisini, Natalia Pérez-Campanero Antolín

TL;DR

A benchmark to assess the ability of Large Language Models to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification is introduced.

Abstract

Cryptographic protocols play a fundamental role in securing modern digital infrastructure, but they are often deployed without prior formal verification. This could lead to the adoption of distributed systems vulnerable to attack vectors. Formal verification methods, on the other hand, require complex and time-consuming techniques that lack automatization. In this paper, we introduce a benchmark to assess the ability of Large Language Models (LLMs) to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification. We created a manually validated dataset of novel, flawed, communication protocols and designed a method to automatically verify the vulnerabilities found by the AI agents. Our results about the performances of the current frontier models on the benchmark provides insights about the possibility of cybersecurity applications by integrating LLMs with symbolic reasoning systems.

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

TL;DR

A benchmark to assess the ability of Large Language Models to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification is introduced.

Abstract

Cryptographic protocols play a fundamental role in securing modern digital infrastructure, but they are often deployed without prior formal verification. This could lead to the adoption of distributed systems vulnerable to attack vectors. Formal verification methods, on the other hand, require complex and time-consuming techniques that lack automatization. In this paper, we introduce a benchmark to assess the ability of Large Language Models (LLMs) to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification. We created a manually validated dataset of novel, flawed, communication protocols and designed a method to automatically verify the vulnerabilities found by the AI agents. Our results about the performances of the current frontier models on the benchmark provides insights about the possibility of cybersecurity applications by integrating LLMs with symbolic reasoning systems.

Paper Structure

This paper contains 33 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the benchmark's structure. The AI agent must identify a vulnerability in a novel protocol within a predetermined number of API calls by interacting with the Tamarin prover and iteratively adapting to its feedback until an attack is found.
  • Figure 2: Comparative performance evaluation of different frontier LLMs across five security protocol verification tasks (detailed in Appendix \ref{['dataset']}). Performance ratings: 1 - Major difficulties with instruction following and frequent syntax errors; 2 - Basic Tamarin code generation with adaptation to feedback, but presence of trivial semantic errors; 3 - Production of syntactically valid Tamarin code with conceptual mistakes; 4 - Successful verification task. See Appendix \ref{['app:examples']} for example errors.

Theorems & Definitions (3)

  • Example 1: Sending To Network Pre-Shared Symmetric Key
  • Example 2: Imposing Structure in Input Messages
  • Example 3: Bad Observable Placement