Table of Contents
Fetching ...

Towards a Classification of Open-Source ML Models and Datasets for Software Engineering

Alexandra González, Xavier Franch, David Lo, Silverio Martínez-Fernández

TL;DR

The need for broader task coverage to enhance the integration of ML within SE practices is underscored, with a primary focus on software development and limited attention to software management.

Abstract

Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.

Towards a Classification of Open-Source ML Models and Datasets for Software Engineering

TL;DR

The need for broader task coverage to enhance the integration of ML within SE practices is underscored, with a primary focus on software development and limited attention to software management.

Abstract

Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.

Paper Structure

This paper contains 19 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Data collection and preparation pipeline.
  • Figure 2: PTMs associated with each SE task and SE activity.
  • Figure 3: Datasets associated with each SE task and SE activity.
  • Figure 4: Top 3 most popular PTMs per SE activity.
  • Figure 5: Top 3 most popular datasets per SE activity.
  • ...and 3 more figures