CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

Nathanaël Beau; Benoît Crabbé

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

Nathanaël Beau, Benoît Crabbé

TL;DR

A novel dataset tailored for code generation, aimed at aiding developers in common tasks, designed for both model finetuning and standalone evaluation of models' strengths and weaknesses in specific coding tasks is introduced.

Abstract

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \url{https://github.com/NathanaelBeau/CodeInsight}.

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

TL;DR

Abstract

Paper Structure (41 sections, 2 figures, 14 tables)

This paper contains 41 sections, 2 figures, 14 tables.

Introduction
Dataset Construction
Data Sources
Data Filtering
Authenticity of Developer Inquiries
Direct Extractability of Code
Natural Language and Code Alignment
Executable Code Samples
Data Annotation
Task 1 - Code Extraction from Stack Overflow
Task 2 - Refinement for Natural Language and Code Consistency
Task 3 - Development of Function Test Cases
Dataset Statistics
Packages Statistics
Labels Statistics
...and 26 more sections

Figures (2)

Figure 1: Curation Workflow from Stack Overflow to Dataset - The filtering phase (left) screens questions based on usefulness, code extractability, alignment, and testability, with one example advancing. The labeling phase (right) details the annotation of this example: extracting and standardizing code, refining the question for clarity with normalized terms, and developing unit tests to validate the function.
Figure 2: Ratio of positive (belonging to a specific category) to negative (not belonging to the category) examples for each of the 10 distinct Categories focusing on item count, average code lines and AST depths. Detailed statistical data supporting this analysis can be found in Appendix \ref{['app:stats-dataset']}.

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

TL;DR

Abstract

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

Authors

TL;DR

Abstract

Table of Contents

Figures (2)