Table of Contents
Fetching ...

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, David Lo

TL;DR

This systematic review identifies seven non-accuracy properties—robustness, security, privacy, explainability, efficiency, and usability—across 146 studies on LLMs for code, detailing how these properties are evaluated and enhanced. It maps current methodologies, highlights gaps (notably in scalable defense, generation-task explainability, and usability across diverse tasks), and proposes data-, human-, and system-centric perspectives to guide future research. The work offers a structured blueprint for improving trustworthy, efficient, and usable code-focused language models and underpins directions for benchmarks and practical tooling. Together, these insights support safer deployment and broader adoption of LLM4Code in real-world software engineering workflows.

Abstract

Large language models for code (LLM4Code), which demonstrate strong performance (e.g., high accuracy) in processing source code, have significantly transformed software engineering. Many studies separately investigate the non-functional properties of LM4Code, but there is no systematic review of how these properties are evaluated and enhanced. This paper fills this gap by thoroughly examining 146 relevant studies, thereby presenting the first systematic literature review to identify seven important properties beyond accuracy, including robustness, security, privacy, explainability, efficiency, and usability. We discuss the current state-of-the-art methods and trends, identify gaps in existing research, and present promising directions for future study.

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

TL;DR

This systematic review identifies seven non-accuracy properties—robustness, security, privacy, explainability, efficiency, and usability—across 146 studies on LLMs for code, detailing how these properties are evaluated and enhanced. It maps current methodologies, highlights gaps (notably in scalable defense, generation-task explainability, and usability across diverse tasks), and proposes data-, human-, and system-centric perspectives to guide future research. The work offers a structured blueprint for improving trustworthy, efficient, and usable code-focused language models and underpins directions for benchmarks and practical tooling. Together, these insights support safer deployment and broader adoption of LLM4Code in real-world software engineering workflows.

Abstract

Large language models for code (LLM4Code), which demonstrate strong performance (e.g., high accuracy) in processing source code, have significantly transformed software engineering. Many studies separately investigate the non-functional properties of LM4Code, but there is no systematic review of how these properties are evaluated and enhanced. This paper fills this gap by thoroughly examining 146 relevant studies, thereby presenting the first systematic literature review to identify seven important properties beyond accuracy, including robustness, security, privacy, explainability, efficiency, and usability. We discuss the current state-of-the-art methods and trends, identify gaps in existing research, and present promising directions for future study.
Paper Structure (41 sections, 2 figures, 4 tables)

This paper contains 41 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Figure (a) shows the cumulative number of papers that are relevant to the topic of this survey in the recent 6 years. The data collection date is 20 Feb 2024. Figure (b) shows the distribution of papers across different properties, where the 'Other' category includes papers that are surveys and discussions.
  • Figure 2: Figure (a) compares paper distribution of papers evaluating LLM4Code and Non-LLM4Code by properties. Figure (b) shows the paper distribution across different publication venues, where the 'Other' category includes venues with less than 2 relevant papers.