Table of Contents
Fetching ...

Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

Hamed Jelodar, Mohammad Meymani, Roozbeh Razavi-Far

TL;DR

This paper provides a systematic survey of how large language models and transformers are applied to source code analysis, spanning tasks from code understanding and summarization to disassembly, decompiling, generation, and security analysis. It synthesizes widely used models (CodeBERT, CodeT5, GPT-family, DeepSeek, Qwen) and domain-adaptive pre-training strategies, and catalogs public datasets (CodeSearchNet, CodeNet, CodeXGLUE, The Stack) while highlighting critical challenges such as long-code handling, dataset limitations, and security biases. The authors contribute a taxonomy of code-analysis tasks, a comparative view of prominent models, a timeline of code-focused datasets, and a discussion of limitations and future directions, aiming to guide researchers and practitioners in selecting models and datasets for reliable, scalable code analysis. The work emphasizes the integration of LLMs with traditional analysis methods and security frameworks to improve efficiency, correctness, and documentation in software development workflows. Overall, the paper informs the design of robust, domain-aware LLM systems for code analytics and outlines concrete avenues for advancing dataset quality, model capabilities, and evaluation benchmarks.

Abstract

Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing efficiency, accuracy, and automation. This paper explores the role of LLMs for different code analysis tasks, focusing on three key aspects: 1) what they can analyze and their applications, 2) what models are used and 3) what datasets are used, and the challenges they face. Regarding the goal of this research, we investigate scholarly articles that explore the use of LLMs for source code analysis to uncover research developments, current trends, and the intellectual structure of this emerging field. Additionally, we summarize limitations and highlight essential tools, datasets, and key challenges, which could be valuable for future work.

Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

TL;DR

This paper provides a systematic survey of how large language models and transformers are applied to source code analysis, spanning tasks from code understanding and summarization to disassembly, decompiling, generation, and security analysis. It synthesizes widely used models (CodeBERT, CodeT5, GPT-family, DeepSeek, Qwen) and domain-adaptive pre-training strategies, and catalogs public datasets (CodeSearchNet, CodeNet, CodeXGLUE, The Stack) while highlighting critical challenges such as long-code handling, dataset limitations, and security biases. The authors contribute a taxonomy of code-analysis tasks, a comparative view of prominent models, a timeline of code-focused datasets, and a discussion of limitations and future directions, aiming to guide researchers and practitioners in selecting models and datasets for reliable, scalable code analysis. The work emphasizes the integration of LLMs with traditional analysis methods and security frameworks to improve efficiency, correctness, and documentation in software development workflows. Overall, the paper informs the design of robust, domain-aware LLM systems for code analytics and outlines concrete avenues for advancing dataset quality, model capabilities, and evaluation benchmarks.

Abstract

Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing efficiency, accuracy, and automation. This paper explores the role of LLMs for different code analysis tasks, focusing on three key aspects: 1) what they can analyze and their applications, 2) what models are used and 3) what datasets are used, and the challenges they face. Regarding the goal of this research, we investigate scholarly articles that explore the use of LLMs for source code analysis to uncover research developments, current trends, and the intellectual structure of this emerging field. Additionally, we summarize limitations and highlight essential tools, datasets, and key challenges, which could be valuable for future work.

Paper Structure

This paper contains 33 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Binary-code summary using LLM
  • Figure 2: The Process of Code Decompiling in Different tasks using Ghidra.
  • Figure 3: Application of LLM models for different tasks for source code analysis
  • Figure 4: A general view of the pre-training of an LLM based on masked language modeling
  • Figure 5: A taxonomy of recent NLP and LLM models for source code analysis
  • ...and 2 more figures