Table of Contents
Fetching ...

Quality evaluation of Tabby coding assistant using real source code snippets

Marta Borek, Robert Nowak

TL;DR

The paper addresses how to reliably evaluate AI-assisted code generation, focusing on TabbyML. It proposes a replicable pipeline that uses real Python implementations from The Algorithms database as ground truth, varying prompt prefixes, and a remote Tabby instance to generate completions, which are then assessed using static metrics and four text-based similarity measures. Key contributions include a pragmatic evaluation framework, a demonstration of how prefix length affects similarity and quality metrics, and a candid discussion of limitations such as the lack of functional clone detection and the use of the smallest model. The findings suggest Tabby can produce high-quality, contextually appropriate code in several scenarios, while highlighting the need for broader benchmarks and AST-based prompt design to capture functional equivalence. The framework and results provide a foundation for ongoing, open benchmarking of coding assistants in practical development settings.

Abstract

Large language models have become a popular tool in software development, providing coding assistance. The proper measurement of the accuracy and reliability of the code produced by such tools is a challenge due to natural language prompts. We propose a simple pipeline that uses state-of-the-art implementation of classic and universal genres of algorithms and data structures. We focus on measuring the quality of TabbyML code assistant due to its open licence and the flexibility in the choice of the language model. Our results presented as cyclomatic complexity, Halstead's Bugs \& Effort and four text-based similarity matrices depict the usability of TabbyML in coding assistance tasks.

Quality evaluation of Tabby coding assistant using real source code snippets

TL;DR

The paper addresses how to reliably evaluate AI-assisted code generation, focusing on TabbyML. It proposes a replicable pipeline that uses real Python implementations from The Algorithms database as ground truth, varying prompt prefixes, and a remote Tabby instance to generate completions, which are then assessed using static metrics and four text-based similarity measures. Key contributions include a pragmatic evaluation framework, a demonstration of how prefix length affects similarity and quality metrics, and a candid discussion of limitations such as the lack of functional clone detection and the use of the smallest model. The findings suggest Tabby can produce high-quality, contextually appropriate code in several scenarios, while highlighting the need for broader benchmarks and AST-based prompt design to capture functional equivalence. The framework and results provide a foundation for ongoing, open benchmarking of coding assistants in practical development settings.

Abstract

Large language models have become a popular tool in software development, providing coding assistance. The proper measurement of the accuracy and reliability of the code produced by such tools is a challenge due to natural language prompts. We propose a simple pipeline that uses state-of-the-art implementation of classic and universal genres of algorithms and data structures. We focus on measuring the quality of TabbyML code assistant due to its open licence and the flexibility in the choice of the language model. Our results presented as cyclomatic complexity, Halstead's Bugs \& Effort and four text-based similarity matrices depict the usability of TabbyML in coding assistance tasks.

Paper Structure

This paper contains 16 sections, 4 figures.

Figures (4)

  • Figure 1: Diagram of the testing pipeline's architecture; (1) Data Preprocessing and Selection, (2) Prefix Selection, (3) Server Interaction, (4) Tabby Completed Program, (5) Similarity Testing.
  • Figure 2: Averaged static quality evaluation outcomes. Green line in plots (a), (b) and (c) denotes the averaged score of reference files per metric.
  • Figure 3: Heatmaps for SequenceMatcher and Jaro-Winkler similarity algorithms outcomes, testing either on the whole file or on the generated snippet only. Outcomes in the [0, 1] range.
  • Figure 4: Heatmaps for Damerau-Levenshtein and Hamming distance algorithms outcomes, testing either on the whole file or on the generated snippet only. Outcomes in the [0, 1] range.