Table of Contents
Fetching ...

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

Ye Liu, Rui Meng, Shafiq Joty, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

TL;DR

CodeXEmbed tackles the challenge of code retrieval by introducing a generalist embedding model family that unifies code and text retrieval across multiple languages and tasks. It employs a two-stage LoRA-based training regime within a unified query-document framework, enhanced by GradCache for scalable training. The approach achieves state-of-the-art results on code retrieval benchmarks and maintains competitive text-retrieval performance, while enabling improved end-to-end retrieval-augmented code generation. The work demonstrates strong cross-domain transferability and provides practical tools for developer-oriented retrieval and coding tasks.

Abstract

Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

TL;DR

CodeXEmbed tackles the challenge of code retrieval by introducing a generalist embedding model family that unifies code and text retrieval across multiple languages and tasks. It employs a two-stage LoRA-based training regime within a unified query-document framework, enhanced by GradCache for scalable training. The approach achieves state-of-the-art results on code retrieval benchmarks and maintains competitive text-retrieval performance, while enabling improved end-to-end retrieval-augmented code generation. The work demonstrates strong cross-domain transferability and provides practical tools for developer-oriented retrieval and coding tasks.

Abstract

Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.

Paper Structure

This paper contains 30 sections, 7 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The illustration depicts multi-stage training, showing the data used at each stage, the LoRA adapters applied, and the continuous optimizer with its learning rate progression.
  • Figure 2: The code training data of CodeXEmbed contains four parts: Text-to-Code, Code-to-Code, Code-to-Text and Hybrid Code. Each Categories contains several types of code tasks.
  • Figure 3: The performance comparison between General Training (GT) and In-domain Training (ID) across three model sizes (400M, 2B, and 7B) on different CoIR categories and the overall average.
  • Figure 4: The programming language distribution of code training data in the general training stage.