Cataloguing Hugging Face Models to Software Engineering Activities: Automation and Findings
Alexandra González, Xavier Franch, David Lo, Silverio Martínez-Fernández
TL;DR
The authors address the lack of SE-specific categorization for Hugging Face PTMs by developing a five-phase pipeline that maps 2,205 PTMs to a detailed 147-task SDLC taxonomy, validated with LLM assistance. They reveal a strong bias toward software implementation and code-generation tasks, but reveal widespread documentation and benchmarking gaps that hinder reuse. The work provides a publicly accessible replication package and an interactive sampling tool to facilitate SE practitioners’ PTM selection, and highlights opportunities to extend the taxonomy to other registries and integrate automated recommendations into SE pipelines. Overall, this study advances SE research by enabling automated cataloguing of large-scale PTM resources and clarifying where the SE community needs better coverage and reporting to support reliable reuse.
Abstract
Context: Open-source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs to support the reliable identification and reuse of models for SE. Objective: To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Method: Our repository mining study followed a five-phase pipeline: (i) identification SE tasks from the literature; (ii) collection of PTM data from the HF API, including model card descriptions and metadata, and the abstracts of the associated arXiv papers; (iii) text processing to ensure consistency; (iv) a two-phase validation of SE relevance, involving humans and LLM assistance, supported by five pilot studies with human annotators and a generalization test; (v) and data analysis. This process yielded a curated catalogue of 2,205 SE PTMs. Results: We find that most SE PTMs target code generation and coding, emphasizing implementation over early or late development stages. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2, while evaluation remains limited: only 9.6% report benchmark results, mostly scoring below 50%. Conclusions: Our catalogue reveals documentation and transparency gaps, highlights imbalances across SDLC phases, and provides a foundation for automated SE scenarios, such as the sampling and selection of suitable PTMs.
