Labeling questions inside issue trackers
Aidin Rasti
TL;DR
This work addresses the challenge of spam-like questions in open-source issue trackers by building a binary classifier to label questions versus non-questions. It constructs a large, labeled dataset from the RapidRelease GitHub corpus, applies heavy text cleaning and English filtering, and compares two state-of-the-art sentence-embedding methods (Sentence-BERT and Universal Sentence Encoder) across multiple classifiers. The best result is achieved with Logistic Regression on USE embeddings, achieving about 81.7% accuracy, demonstrating the feasibility of automatic question filtering to alleviate maintainer triage burdens. The study also discusses limitations, potential pipeline integrations with multi-class defect labeling, and avenues for future improvements, such as hyperparameter tuning and exploring neural architectures.
Abstract
One of the issues faced by the maintainers of popular open source software is the triage of newly reported issues. Many of the issues submitted to issue trackers are questions. Many people ask questions on issue trackers about their problem instead of using a proper QA website like StackOverflow. This may seem insignificant but for many of the big projects with thousands of users, this leads to spamming of the issue tracker. Reading and labeling these unrelated issues manually is a serious time consuming task and these unrelated questions add to the burden. In fact, most often maintainers demand to not submit questions in the issue tracker. To address this problem, first, we leveraged dozens of patterns to clean text of issues, we removed noises like logs, stack traces, environment variables, error messages, etc. Second, we have implemented a classification-based approach to automatically label unrelated questions. Empirical evaluations on a dataset of more than 102,000 records show that our approach can label questions with an accuracy of over 81%.
