Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

Claudio Di Sipio; Riccardo Rubei; Juri Di Rocco; Davide Di Ruscio; Phuong T. Nguyen

Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

Claudio Di Sipio, Riccardo Rubei, Juri Di Rocco, Davide Di Ruscio, Phuong T. Nguyen

TL;DR

This paper tackles the challenge of selecting suitable pre-trained models for software engineering tasks by proposing a semi-automated mapping between Hugging Face pipeline tags and SE tasks. It leverages the HF data dump, builds an SE-task taxonomy, and uses a similarity-based algorithm with a threshold $T=0.8$ to link PTMs to SE tasks, evaluated through text classifiers on model-card data. Key contributions include an initial PTM→SE-task mapping, an empirical evaluation showing model cards can support automatic categorization (with SVC outperforming CNB by ~10%), and a replication package for future work. The approach lays groundwork for SE-oriented recommender systems (RSSEs) and suggests extending the mapping to other PTM repositories to enhance practical PTM discovery for developers.

Abstract

Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large machine learning (ML) models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories. This paper introduces an approach to address this gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying on model names.

Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

TL;DR

to link PTMs to SE tasks, evaluated through text classifiers on model-card data. Key contributions include an initial PTM→SE-task mapping, an empirical evaluation showing model cards can support automatic categorization (with SVC outperforming CNB by ~10%), and a replication package for future work. The approach lays groundwork for SE-oriented recommender systems (RSSEs) and suggests extending the mapping to other PTM repositories to enhance practical PTM discovery for developers.

Abstract

Paper Structure (12 sections, 6 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 6 figures, 3 tables, 1 algorithm.

Introduction
Motivation and Background
Overview of the PTM reuse workflow
The Hugging Face model repository
Proposed approach
HF data gathering
SE task filtering
Mapping phase
Preliminary evaluation
Threats to validity
Related Work
Conclusion and Future work

Figures (6)

Figure 1: The Hugging Face PTM reuse-oriented capabilities.
Figure 2: The proposed mapping approach.
Figure 3: The performed process to elicit SE tasks.
Figure 4: The identified SE macro tasks.
Figure :
...and 1 more figures

Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

TL;DR

Abstract

Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (6)