Table of Contents
Fetching ...

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Vincent Urias, Lorie M. Liebrock

TL;DR

The paper investigates the feasibility and risks of using open source Generative Pre-trained Transformers to automatically rewrite implementations of a cryptographic hash function SHA-1 in C. It employs retrieval augmented generation with full reference code grounding across three GPT models, generating thousands of function rewrites for each of the four SHA-1 components and evaluating them with a multi-stage testbed that includes compilation across multiple compilers and optimization levels, as well as memory safety checks. A comprehensive set of metrics quantifies compilability, correctness across test vectors, and algorithmic integrity, revealing both the potential for correct invariant re-writes and the prevalence of insecure or unstable variants. The findings highlight substantial security concerns and the need for rigorous testing when deploying GPT-based code generation in security sensitive domains, while also demonstrating the utility of large-scale GPT code variation for research on code generation and malware detection signatures.

Abstract

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 150,000 function re-write GPT output text blocks, approximately 50,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.

Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models

TL;DR

The paper investigates the feasibility and risks of using open source Generative Pre-trained Transformers to automatically rewrite implementations of a cryptographic hash function SHA-1 in C. It employs retrieval augmented generation with full reference code grounding across three GPT models, generating thousands of function rewrites for each of the four SHA-1 components and evaluating them with a multi-stage testbed that includes compilation across multiple compilers and optimization levels, as well as memory safety checks. A comprehensive set of metrics quantifies compilability, correctness across test vectors, and algorithmic integrity, revealing both the potential for correct invariant re-writes and the prevalence of insecure or unstable variants. The findings highlight substantial security concerns and the need for rigorous testing when deploying GPT-based code generation in security sensitive domains, while also demonstrating the utility of large-scale GPT code variation for research on code generation and malware detection signatures.

Abstract

Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 150,000 function re-write GPT output text blocks, approximately 50,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.
Paper Structure (16 sections, 5 figures, 1 table)

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Correct code rewrite performance metrics as a function of inference temperature.
  • Figure 2: Correct code rewrite performance metrics as a function of the $10$ different prompts.
  • Figure 3: Correct code rewrite performance metrics as a function of the three GPT models.
  • Figure 4: Visualized examples of compiled SHA-1 binaries using binocle. In order from top left to right; a single function-rewrite that is fully correct, 2 function rewrites where all test cases fail (and the outputs are not close to the correct SHA-1 hashes), and 1 function re-write that resulted in some of the test vectors producing a correct SHA-1 hash but failing for at least one test vector (and was not compiler optimization unstable). All of these example binaries were compiled using clang with an optimization level of $0$. These binaries were arbitrarily selected as representative examples.
  • Figure 5: Graph renderings of various example connected components from the compiled binary clustering procedure; specifically for the GPT function re-writes of the SHA-1 C code where the compiled binary correctly produced SHA-1 hashes (and did not have fatal errors, or compiler optimization instability). Each node (blue) represents a tuple of a single SHA-1 component function re-write whose source code had Levenshtein character distance greater than $0$ compared to the original source code (after repeated whitespace and code comments were removed) and one of the $13$ compiler optimization settings used (either gcc or clang, with varying optimization levels). In other words, each node represents a single compiled binary that correctly executed the SHA-1 algorithm. Each edge represents the SHA-256 checksum of the compiled binary being equal for the two compiled binaries that the edge connects. These networks are not the comprehensive clustering of the correct SHA-1 rewrites, but they do represent a majority of the graphs that were produced. Notably, four of these graphs which are noticeably larger and more densely connected than the other graphs correspond to the graphs of function re-writes that are equivalent to the original source code due to the syntax changes made by the GPT models being relatively minimal. Each of these graphs are single connected components from the overall binary hashing clustering procedure, which is described in Section \ref{['section:methods_GPT_output_parsing_and_testing']} and the summary statistics for are shown in Table \ref{['table:SHA-1_aggregate_metrics']}.