Towards Refining Developer Questions using LLM-Based Named Entity Recognition for Developer Chatroom Conversations
Pouya Fathollahzadeh, Mariam El Mezouar, Hao Li, Ying Zou, Ahmed E. Hassan
TL;DR
This work tackles the problem of imprecise questions in software developer chatrooms by introducing SENIR, an LLM-based framework for software-specific NER, intent detection, and resolution classification. It leverages a Mixtral 8x7B model to label 29,243 DISCO conversations, creating a gold standard dataset (400 conversations) for evaluation and constructing a 20-feature predictive model of question resolution that achieves an AUC around 0.75 under multiple sampling and validation schemes. The study provides actionable insights showing that precise entities (e.g., Library Function, Library Class) and positive sentiment improve resolution, while excessive URLs and late posting hinder it, with significant differences across intents as shown by Chi-Square analyses. The results support practical guidance for developers and chat platforms, such as structured templates and automated tagging to improve clarity and responsiveness, and suggest broader applicability of SENIR to other SE datasets and retrieval-based chatbots.
Abstract
In software engineering chatrooms, communication is often hindered by imprecise questions that cannot be answered. Recognizing key entities can be essential for improving question clarity and facilitating better exchange. However, existing research using natural language processing techniques often overlooks these software-specific nuances. In this paper, we introduce Software-specific Named Entity Recognition, Intent Detection, and Resolution Classification (SENIR), a labeling approach that leverages a Large Language Model to annotate entities, intents, and resolution status in developer chatroom conversations. To offer quantitative guidance for improving question clarity and resolvability, we build a resolution prediction model that leverages SENIR's entity and intent labels along with additional predictive features. We evaluate SENIR on the DISCO dataset using a subset of annotated chatroom dialogues. SENIR achieves an 86% F-score for entity recognition, a 71% F-score for intent detection, and an 89% F-score for resolution status classification. Furthermore, our resolution prediction model, tested with various sampling strategies (random undersampling and oversampling with SMOTE) and evaluation methods (5-fold cross-validation, 10-fold cross-validation, and bootstrapping), demonstrates AUC values ranging from 0.7 to 0.8. Key factors influencing resolution include positive sentiment and entities such as Programming Language and User Variable across multiple intents, while diagnostic entities are more relevant in error-related questions. Moreover, resolution rates vary significantly by intent: questions about API Usage and API Change achieve higher resolution rates, whereas Discrepancy and Review have lower resolution rates. A Chi-Square analysis confirms the statistical significance of these differences.
