Table of Contents
Fetching ...

Selective Shot Learning for Code Explanation

Paheli Bhattacharya, Rishabh Gupta

TL;DR

This work tackles code explanation via few-shot prompting for open-source Code-LLMs and identifies selective-shot learning (SSL) as a key lever. It introduces SSL_ner, a token/embedding-free approach that uses code entity information to select demonstrations, and benchmarks open-source Code-LLMs on two datasets (CoNaLa inline Python and TLC function-level Java). Empirically, SSL_ner often yields the best token-based demonstrations and reveals that medium-sized LLMs benefit more from few-shot prompting, while CodeLlama 34B excels in zero-shot settings. The study provides a principled, interpretable, and extensible framework for few-shot prompt design and establishes a first systematic benchmark of open-source Code-LLMs for code explanation.

Abstract

Code explanation plays a crucial role in the software engineering domain, aiding developers in grasping code functionality efficiently. Recent work shows that the performance of LLMs for code explanation improves in a few-shot setting, especially when the few-shot examples are selected intelligently. State-of-the-art approaches for such Selective Shot Learning (SSL) include token-based and embedding-based methods. However, these SSL approaches have been evaluated on proprietary LLMs, without much exploration on open-source Code-LLMs. Additionally, these methods lack consideration for programming language syntax. To bridge these gaps, we present a comparative study and propose a novel SSL method (SSL_ner) that utilizes entity information for few-shot example selection. We present several insights and show the effectiveness of SSL_ner approach over state-of-the-art methods across two datasets. To the best of our knowledge, this is the first systematic benchmarking of open-source Code-LLMs while assessing the performances of the various few-shot examples selection approaches for the code explanation task.

Selective Shot Learning for Code Explanation

TL;DR

This work tackles code explanation via few-shot prompting for open-source Code-LLMs and identifies selective-shot learning (SSL) as a key lever. It introduces SSL_ner, a token/embedding-free approach that uses code entity information to select demonstrations, and benchmarks open-source Code-LLMs on two datasets (CoNaLa inline Python and TLC function-level Java). Empirically, SSL_ner often yields the best token-based demonstrations and reveals that medium-sized LLMs benefit more from few-shot prompting, while CodeLlama 34B excels in zero-shot settings. The study provides a principled, interpretable, and extensible framework for few-shot prompt design and establishes a first systematic benchmark of open-source Code-LLMs for code explanation.

Abstract

Code explanation plays a crucial role in the software engineering domain, aiding developers in grasping code functionality efficiently. Recent work shows that the performance of LLMs for code explanation improves in a few-shot setting, especially when the few-shot examples are selected intelligently. State-of-the-art approaches for such Selective Shot Learning (SSL) include token-based and embedding-based methods. However, these SSL approaches have been evaluated on proprietary LLMs, without much exploration on open-source Code-LLMs. Additionally, these methods lack consideration for programming language syntax. To bridge these gaps, we present a comparative study and propose a novel SSL method (SSL_ner) that utilizes entity information for few-shot example selection. We present several insights and show the effectiveness of SSL_ner approach over state-of-the-art methods across two datasets. To the best of our knowledge, this is the first systematic benchmarking of open-source Code-LLMs while assessing the performances of the various few-shot examples selection approaches for the code explanation task.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The workflow of the code explanation pipeline using Selective Shot Learning (SSL) approaches. In the input we have a query code snippet $q$ whose explanation needs to be generated and a training database containing $(code\ snippet, code\ explanation)$ pairs from which the few-shot examples need to be selected. The training data samples are ranked according to their similarity with $q$, where similarity can be computed using either $Selection_{token}$, $Selection_{semantic}$ or $SSL_{ner}$. From the ranked list, top-k examples are selected and given as a prompt along with $q$ to an LLM which then generates the explanation.
  • Figure 2: An example demonstrating the Query Code method, the top 1 demonstration example selected by $Selection_{token}$, $Selection_{semantic}$ and $SSL_{ner}$ along with the LLM (StarCoder) generated output for each method, respectively.
  • Figure 3: An example demonstrating the Query Code method, the top 1 demonstration example selected by $Selection_{token}$, $Selection_{semantic}$ and $SSL_{ner}$ along with the LLM (StarCoder) generated output for each method, respectively. Due to the lengthy function-level codes and page limitation, we omit portions of the selected codes in the middle.
  • Figure 4: Figure demonstrating a query code sample, the top 3 examples selected by $Selection_{token}$ and the explanation generated for the query code sample using StarCoder. Due to the lengthy function-level codes and page limitation, we omit portions of the selected codes in the middle.
  • Figure 5: Figure demonstrating a query code sample and the top 3 examples selected by $SSL_{ner}$ and the explanation generated for the query code sample using StarCoder.