Table of Contents
Fetching ...

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

TL;DR

This work tackles vulnerabilities in LLM-generated code by leveraging the Scala type system through an agentic prompting framework called TypePilot. By structuring multi-step interactions that first generate code, then detect vulnerabilities, and finally refine the code with type-guided safeguards, the approach aims to convert typing guarantees into active safety properties for generated software. Empirical results show limited success with formal verification in Stainless, but clear improvements in general Scala code generation, particularly in reducing input-constraint violations and code injection across multiple models, with the strongest gains from certain configurations like Qwen-2.5-Coder (32B). The contributions demonstrate that type-system-driven, agented workflows can significantly enhance trustworthiness in automated code generation for high-assurance domains and offer a practical path toward safer AI-assisted software engineering, albeit with scalability and verification limitations to address in future work.

Abstract

Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

TL;DR

This work tackles vulnerabilities in LLM-generated code by leveraging the Scala type system through an agentic prompting framework called TypePilot. By structuring multi-step interactions that first generate code, then detect vulnerabilities, and finally refine the code with type-guided safeguards, the approach aims to convert typing guarantees into active safety properties for generated software. Empirical results show limited success with formal verification in Stainless, but clear improvements in general Scala code generation, particularly in reducing input-constraint violations and code injection across multiple models, with the strongest gains from certain configurations like Qwen-2.5-Coder (32B). The contributions demonstrate that type-system-driven, agented workflows can significantly enhance trustworthiness in automated code generation for high-assurance domains and offer a practical path toward safer AI-assisted software engineering, albeit with scalability and verification limitations to address in future work.

Abstract

Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.

Paper Structure

This paper contains 18 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The full pipeline for the generation of code using TypePilot. After the initial generation of the code, the vulnerabilities are detected by a separate instance of the LLM. Then, a final LLM is prompted to leverage the Scala type system to improve the initial code, given the detected vulnerabilities.
  • Figure 2: Prompt used to generate the Stainless code.
  • Figure 3: Prompt used to generate the code in the robust prompting setting.
  • Figure 4: Prompts used to generate the initial code, the vulnerabilities and the final code in TypePilot. The final prompt guides the LLM to use the Scala type system to make the code more robust.
  • Figure 5: Comparison of baseline and TypePilot average age function generations from Qwen-2.5-Coder (32B)
  • ...and 5 more figures