Table of Contents
Fetching ...

Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning

Karim Galliamov, Leila Khaertdinova, Karina Denisova

TL;DR

This work addresses efficient adaptation of bimodal code-text retrieval models under resource constraints by applying Parameter-Efficient Fine-Tuning (PEFT) combined with contrastive learning to CodeT5+. The authors benchmark several PEFT methods (LoRA, AdaLoRA, IA3, Prompt-Tuning) across CodeSearchNet and a custom dataset, showing that only around $0.4\%$ of parameters need updating to improve retrieval quality. They also integrate the tuned embeddings into a Retrieval-Augmented Generation (RAG) pipeline, achieving modest ROUGE gains in code generation and providing open-source checkpoints. The study demonstrates the practicality of PEFT for code retrieval and offers a reusable framework for systematic benchmarking of PEFT methods in bimodal tasks.

Abstract

The latest developments in Natural Language Processing (NLP) have demonstrated remarkable progress in a code-text retrieval problem. As the Transformer-based models used in this task continue to increase in size, the computational costs and time required for end-to-end fine-tuning become substantial. This poses a significant challenge for adapting and utilizing these models when computational resources are limited. Motivated by these concerns, we propose a fine-tuning framework that leverages Parameter-Efficient Fine-Tuning (PEFT) techniques. Moreover, we adopt contrastive learning objectives to improve the quality of bimodal representations learned by transformer models. Additionally, for PEFT methods we provide extensive benchmarking, the lack of which has been highlighted as a crucial problem in the literature. Based on the thorough experimentation with the CodeT5+ model conducted on two datasets, we demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.

Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning

TL;DR

This work addresses efficient adaptation of bimodal code-text retrieval models under resource constraints by applying Parameter-Efficient Fine-Tuning (PEFT) combined with contrastive learning to CodeT5+. The authors benchmark several PEFT methods (LoRA, AdaLoRA, IA3, Prompt-Tuning) across CodeSearchNet and a custom dataset, showing that only around of parameters need updating to improve retrieval quality. They also integrate the tuned embeddings into a Retrieval-Augmented Generation (RAG) pipeline, achieving modest ROUGE gains in code generation and providing open-source checkpoints. The study demonstrates the practicality of PEFT for code retrieval and offers a reusable framework for systematic benchmarking of PEFT methods in bimodal tasks.

Abstract

The latest developments in Natural Language Processing (NLP) have demonstrated remarkable progress in a code-text retrieval problem. As the Transformer-based models used in this task continue to increase in size, the computational costs and time required for end-to-end fine-tuning become substantial. This poses a significant challenge for adapting and utilizing these models when computational resources are limited. Motivated by these concerns, we propose a fine-tuning framework that leverages Parameter-Efficient Fine-Tuning (PEFT) techniques. Moreover, we adopt contrastive learning objectives to improve the quality of bimodal representations learned by transformer models. Additionally, for PEFT methods we provide extensive benchmarking, the lack of which has been highlighted as a crucial problem in the literature. Based on the thorough experimentation with the CodeT5+ model conducted on two datasets, we demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.
Paper Structure (21 sections, 2 equations, 4 figures, 6 tables)

This paper contains 21 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The proposed fine-tuning framework. Contrastive loss aims to maximize similarities between corresponding code-text pairs and minimize the similarities of non-matching pairs. For visual clarity, that is schematically demonstrated for one positive pair in a batch, namely "hello world" text and Java code pair. During fine-tuning, CodeT5+ is tuned using PEFT techniques.
  • Figure 2: The distribution of token length for NL code docstring and PL code snippets, respectively, for PLs included in our dataset.
  • Figure 3: Integration of the best checkpoints of our fine-tuned models into the RAG pipeline for different PLs used in the study. The figure provides an example of the code generation for C++.
  • Figure 4: Validation losses plots for the embeddings model on PoC datasets.