Table of Contents
Fetching ...

AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

Yu Wang, Yijian Liu, Liheng Ji, Han Luo, Wenjie Li, Xiaofei Zhou, Chiyun Feng, Puji Wang, Yuhan Cao, Geyuan Zhang, Xiaojian Li, Rongwu Xu, Yilei Chen, Tianxing He

TL;DR

AICrypto provides the first comprehensive benchmark to evaluate cryptography capabilities of large language models, covering three task types: 135 MCQs, 150 CTF challenges, and 18 proof problems. The benchmark uses expert-curated data and an agent-based framework for CTF evaluation, and it compares 17 leading LLMs against human baselines. Results show that state-of-the-art models match or surpass humans on memorization and routine proofs but struggle with multi-step planning and dynamic reasoning required for real-world cryptographic tasks such as CTF challenges. The work highlights both progress and existing gaps in LLM cryptographic reasoning, offering a foundation for future research and practical cryptographic applications while providing open-source resources for reproducibility.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose AICrypto, the first comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our case studies reveal that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io/.

AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

TL;DR

AICrypto provides the first comprehensive benchmark to evaluate cryptography capabilities of large language models, covering three task types: 135 MCQs, 150 CTF challenges, and 18 proof problems. The benchmark uses expert-curated data and an agent-based framework for CTF evaluation, and it compares 17 leading LLMs against human baselines. Results show that state-of-the-art models match or surpass humans on memorization and routine proofs but struggle with multi-step planning and dynamic reasoning required for real-world cryptographic tasks such as CTF challenges. The work highlights both progress and existing gaps in LLM cryptographic reasoning, offering a foundation for future research and practical cryptographic applications while providing open-source resources for reproducibility.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose AICrypto, the first comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our case studies reveal that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io/.

Paper Structure

This paper contains 98 sections, 30 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Overview of the AICrypto benchmark.
  • Figure 2: Comparison of LLMs' performance on AICrypto. For each model (ordered left-to-right by descending composite score), MCQ accuracy (teal), CTF success rate pass@3 (orange), and average proof scoring rate (purple) are stacked to yield the composite score.
  • Figure 3: An example multiple-choice question from AICrypto.
  • Figure 4: An example of CTF challenge from AICrypto. Due to space constraints, only a portion of output.txt is shown. The marker "(303 digits)" indicates that 303 digits have been omitted.
  • Figure 5: A successful challenge-solving process by o3-high. The challenge corresponds to the one shown in Figure \ref{['fig:ctf_example']}. For clarity, some model outputs and formatting details are omitted. The green box indicates the model's output, while the blue box represents feedback from the environment. The model correctly identifies the RSA vulnerability of common-modulus and successfully writes a script to recover the flag.
  • ...and 21 more figures