Table of Contents
Fetching ...

SoK: Understanding (New) Security Issues Across AI4Code Use Cases

Qilong Wu, Taoran Li, Tianyang Zhou, Varun Chandrasekaran

TL;DR

This SoK analyzes security across AI4Code use cases—code generation, vulnerability detection, and code translation—highlighting systemic gaps like Python monocultures, insecure outputs, and weak robustness. It synthesizes a broad experimental program examining misalignment, vulnerability reproduction, and translation effects, revealing that higher functional performance often coexists with weaker security and that robustness requires evaluation beyond standard accuracy. The work proposes security-by-default practices, robust benchmarks, and translation-based security refactoring as pathways to safer AI4Code deployment, and outlines 11 future directions to embed security throughout lifecycle workflows. Collectively, it reframes AI4Code development as security-first engineering, emphasizing adversarial resilience, privacy safeguards, and trustworthy governance across tools and pipelines.

Abstract

AI-for-Code (AI4Code) systems are reshaping software engineering, with tools like GitHub Copilot accelerating code generation, translation, and vulnerability detection. Alongside these advances, however, security risks remain pervasive: insecure outputs, biased benchmarks, and susceptibility to adversarial manipulation undermine their reliability. This SoK surveys the landscape of AI4Code security across three core applications, identifying recurring gaps: benchmark dominance by Python and toy problems, lack of standardized security datasets, data leakage in evaluation, and fragile adversarial robustness. A comparative study of six state-of-the-art models illustrates these challenges: insecure patterns persist in code generation, vulnerability detection is brittle to semantic-preserving attacks, fine-tuning often misaligns security objectives, and code translation yields uneven security benefits. From this analysis, we distill three forward paths: embedding secure-by-default practices in code generation, building robust and comprehensive detection benchmarks, and leveraging translation as a route to security-enhanced languages. We call for a shift toward security-first AI4Code, where vulnerability mitigation and robustness are embedded throughout the development life cycle.

SoK: Understanding (New) Security Issues Across AI4Code Use Cases

TL;DR

This SoK analyzes security across AI4Code use cases—code generation, vulnerability detection, and code translation—highlighting systemic gaps like Python monocultures, insecure outputs, and weak robustness. It synthesizes a broad experimental program examining misalignment, vulnerability reproduction, and translation effects, revealing that higher functional performance often coexists with weaker security and that robustness requires evaluation beyond standard accuracy. The work proposes security-by-default practices, robust benchmarks, and translation-based security refactoring as pathways to safer AI4Code deployment, and outlines 11 future directions to embed security throughout lifecycle workflows. Collectively, it reframes AI4Code development as security-first engineering, emphasizing adversarial resilience, privacy safeguards, and trustworthy governance across tools and pipelines.

Abstract

AI-for-Code (AI4Code) systems are reshaping software engineering, with tools like GitHub Copilot accelerating code generation, translation, and vulnerability detection. Alongside these advances, however, security risks remain pervasive: insecure outputs, biased benchmarks, and susceptibility to adversarial manipulation undermine their reliability. This SoK surveys the landscape of AI4Code security across three core applications, identifying recurring gaps: benchmark dominance by Python and toy problems, lack of standardized security datasets, data leakage in evaluation, and fragile adversarial robustness. A comparative study of six state-of-the-art models illustrates these challenges: insecure patterns persist in code generation, vulnerability detection is brittle to semantic-preserving attacks, fine-tuning often misaligns security objectives, and code translation yields uneven security benefits. From this analysis, we distill three forward paths: embedding secure-by-default practices in code generation, building robust and comprehensive detection benchmarks, and leveraging translation as a route to security-enhanced languages. We call for a shift toward security-first AI4Code, where vulnerability mitigation and robustness are embedded throughout the development life cycle.

Paper Structure

This paper contains 35 sections, 19 figures, 21 tables.

Figures (19)

  • Figure 1: Publication trends in AI4Code security research (2018--2025) by method type and task domain.
  • Figure 2: CWE-specific detection for C. Resource Management Errors (91%) are easiest, Pointer Issues (79%) hardest. Trends hold across models.
  • Figure 3: Vulnerability rates of translated code. Ground Truth represents dataset labels, while LLM baseline represents Claude4's detection results on the original untranslated code.
  • Figure 4: Distribution of CWE types by programming language across selected tasks. Q$_1$ and Q$_3$ distributions are shown in Figure \ref{['fig:cwe_by_language_Q1_Q3']} (Appendix \ref{['app:cwe_by_language_additional']}). Models appear left-to-right: Claude4, Gemini, GPT-4o, Llama4, o3, and Qwen3.
  • Figure 5: Unified robustness analysis across prompting strategies. The y-axis shows relative BLEU drop (%) from clean to attacked code; lower is better.
  • ...and 14 more figures