Quality evaluation of Tabby coding assistant using real source code snippets
Marta Borek, Robert Nowak
TL;DR
The paper addresses how to reliably evaluate AI-assisted code generation, focusing on TabbyML. It proposes a replicable pipeline that uses real Python implementations from The Algorithms database as ground truth, varying prompt prefixes, and a remote Tabby instance to generate completions, which are then assessed using static metrics and four text-based similarity measures. Key contributions include a pragmatic evaluation framework, a demonstration of how prefix length affects similarity and quality metrics, and a candid discussion of limitations such as the lack of functional clone detection and the use of the smallest model. The findings suggest Tabby can produce high-quality, contextually appropriate code in several scenarios, while highlighting the need for broader benchmarks and AST-based prompt design to capture functional equivalence. The framework and results provide a foundation for ongoing, open benchmarking of coding assistants in practical development settings.
Abstract
Large language models have become a popular tool in software development, providing coding assistance. The proper measurement of the accuracy and reliability of the code produced by such tools is a challenge due to natural language prompts. We propose a simple pipeline that uses state-of-the-art implementation of classic and universal genres of algorithms and data structures. We focus on measuring the quality of TabbyML code assistant due to its open licence and the flexibility in the choice of the language model. Our results presented as cyclomatic complexity, Halstead's Bugs \& Effort and four text-based similarity matrices depict the usability of TabbyML in coding assistance tasks.
