Table of Contents
Fetching ...

COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

Kuldeep Gautam, S. VenkataKeerthy, Ramakrishna Upadrasta

TL;DR

COFO introduces a large-scale, multi-language dataset of 369K programs across 809 Codeforces problems to advance program classification, code tagging, and NLP-based code comprehension. It describes a Selenium-BeautifulSoup-based scraping pipeline that uses the Codeforces API to collect problem metadata, problem specifications, test cases, and accepted submissions, organized in a per-problem directory structure with language-specific submissions. Key contributions include the first large, language-diverse benchmark for program classification and tagging, with detailed statistics on languages, classes, test cases, and code tags, enabling cross-language analysis and cloning detection. The dataset supports practical ML research in software engineering and code understanding, and the authors provide an open-source toolchain to reproduce and extend the collection process.

Abstract

In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.

COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

TL;DR

COFO introduces a large-scale, multi-language dataset of 369K programs across 809 Codeforces problems to advance program classification, code tagging, and NLP-based code comprehension. It describes a Selenium-BeautifulSoup-based scraping pipeline that uses the Codeforces API to collect problem metadata, problem specifications, test cases, and accepted submissions, organized in a per-problem directory structure with language-specific submissions. Key contributions include the first large, language-diverse benchmark for program classification and tagging, with detailed statistics on languages, classes, test cases, and code tags, enabling cross-language analysis and cloning detection. The dataset supports practical ML research in software engineering and code understanding, and the authors provide an open-source toolchain to reproduce and extend the collection process.

Abstract

In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.

Paper Structure

This paper contains 14 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Schematic of the Scraping Process and the Directory Structure
  • Figure 2: Dataset Distribution