Table of Contents
Fetching ...

Can I Solve It? Identifying APIs Required to Complete OSS Task

Fabio Santos, Igor Wiese, Bianca Trinkenreich, Igor Steinmacher, Anita Sarma, Marco Gerosa

TL;DR

This study addresses the challenge of guiding OSS contributors to suitable tasks by automatically labeling issues with API-domain domains. It presents a three-phase methodology: mining JabRef to build ground-truth API-domain labels, constructing and evaluating multi-label TF-IDF-based classifiers (with Random Forest performing best), and conducting a developer study to assess label relevance. Results show the classifier can predict API-domain labels with precision around 0.76 and recall around 0.75, and that API-domain labels significantly increase perceived usefulness for task selection, especially among industry practitioners and experienced developers. The work demonstrates practical potential for automating skill-directed task matching and outlines replication data and future directions, including broader project validation and richer embedding-based techniques.

Abstract

Open Source Software projects add labels to open issues to help contributors choose tasks. However, manually labeling issues is time-consuming and error-prone. Current automatic approaches for creating labels are mostly limited to classifying issues as a bug/non-bug. In this paper, we investigate the feasibility and relevance of labeling issues with the domain of the APIs required to complete the tasks. We leverage the issues' description and the project history to build prediction models, which resulted in precision up to 82% and recall up to 97.8%. We also ran a user study (n=74) to assess these labels' relevancy to potential contributors. The results show that the labels were useful to participants in choosing tasks, and the API-domain labels were selected more often than the existing architecture-based labels. Our results can inspire the creation of tools to automatically label issues, helping developers to find tasks that better match their skills.

Can I Solve It? Identifying APIs Required to Complete OSS Task

TL;DR

This study addresses the challenge of guiding OSS contributors to suitable tasks by automatically labeling issues with API-domain domains. It presents a three-phase methodology: mining JabRef to build ground-truth API-domain labels, constructing and evaluating multi-label TF-IDF-based classifiers (with Random Forest performing best), and conducting a developer study to assess label relevance. Results show the classifier can predict API-domain labels with precision around 0.76 and recall around 0.75, and that API-domain labels significantly increase perceived usefulness for task selection, especially among industry practitioners and experienced developers. The work demonstrates practical potential for automating skill-directed task matching and outlines replication data and future directions, including broader project validation and richer embedding-based techniques.

Abstract

Open Source Software projects add labels to open issues to help contributors choose tasks. However, manually labeling issues is time-consuming and error-prone. Current automatic approaches for creating labels are mostly limited to classifying issues as a bug/non-bug. In this paper, we investigate the feasibility and relevance of labeling issues with the domain of the APIs required to complete the tasks. We leverage the issues' description and the project history to build prediction models, which resulted in precision up to 82% and recall up to 97.8%. We also ran a user study (n=74) to assess these labels' relevancy to potential contributors. The results show that the labels were useful to participants in choosing tasks, and the API-domain labels were selected more often than the existing architecture-based labels. Our results can inspire the creation of tools to automatically label issues, helping developers to find tasks that better match their skills.

Paper Structure

This paper contains 22 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Research Design
  • Figure 2: Number of labels per issue
  • Figure 3: Survey question about the regions relevance
  • Figure 4: Comparison between the unigram model and n-grams models
  • Figure 5: Comparison between the baseline model and other machine learning algorithms
  • ...and 2 more figures