Table of Contents
Fetching ...

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

Yicheng Tao, Yao Qin, Yepang Liu

TL;DR

Repository-scale software development demands long-range reasoning across multiple files; traditional function- or file-level generation is insufficient. The paper surveys Retrieval-Augmented Code Generation (RACG) with emphasis on repository-level techniques, presenting a unified taxonomy of retrieval strategies, architectures, training paradigms, and benchmarks. It synthesizes datasets, identifies limitations, and outlines opportunities such as multimodal context, memory-efficient designs, and tighter retrieval-generation coupling. The analysis aims to guide researchers and practitioners toward deployable, scalable AI-assisted software engineering.

Abstract

Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

TL;DR

Repository-scale software development demands long-range reasoning across multiple files; traditional function- or file-level generation is insufficient. The paper surveys Retrieval-Augmented Code Generation (RACG) with emphasis on repository-level techniques, presenting a unified taxonomy of retrieval strategies, architectures, training paradigms, and benchmarks. It synthesizes datasets, identifies limitations, and outlines opportunities such as multimodal context, memory-efficient designs, and tighter retrieval-generation coupling. The analysis aims to guide researchers and practitioners toward deployable, scalable AI-assisted software engineering.

Abstract

Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.

Paper Structure

This paper contains 50 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparison between General Code Generation and Repository-Level Code Generation
  • Figure 2: Retrieval-Augmented Code Generation, with a focus on Repository-Level approaches
  • Figure 3: Selected Paper Distribution
  • Figure 4: Top contributing universities and companies in RACG-related research.
  • Figure 5: Mapping between research questions and corresponding survey sections. Each flow indicates the relationship between a specific RQ and the sections that address it, providing a clear roadmap for readers to navigate the survey content.
  • ...and 5 more figures