Table of Contents
Fetching ...

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

Nathanaël Beau, Benoît Crabbé

TL;DR

A novel dataset tailored for code generation, aimed at aiding developers in common tasks, designed for both model finetuning and standalone evaluation of models' strengths and weaknesses in specific coding tasks is introduced.

Abstract

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \url{https://github.com/NathanaelBeau/CodeInsight}.

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

TL;DR

A novel dataset tailored for code generation, aimed at aiding developers in common tasks, designed for both model finetuning and standalone evaluation of models' strengths and weaknesses in specific coding tasks is introduced.

Abstract

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \url{https://github.com/NathanaelBeau/CodeInsight}.
Paper Structure (41 sections, 2 figures, 14 tables)

This paper contains 41 sections, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Curation Workflow from Stack Overflow to Dataset - The filtering phase (left) screens questions based on usefulness, code extractability, alignment, and testability, with one example advancing. The labeling phase (right) details the annotation of this example: extracting and standardizing code, refining the question for clarity with normalized terms, and developing unit tests to validate the function.
  • Figure 2: Ratio of positive (belonging to a specific category) to negative (not belonging to the category) examples for each of the 10 distinct Categories focusing on item count, average code lines and AST depths. Detailed statistical data supporting this analysis can be found in Appendix \ref{['app:stats-dataset']}.