What Characteristics Make ChatGPT Effective for Software Issue Resolution? An Empirical Study of Task, Project, and Conversational Signals in GitHub Issues
Ramtin Ehsani, Sakshi Pathak, Esteban Parra, Sonia Haiduc, Preetha Chatterjee
TL;DR
This study empirically analyzes 686 developer-ChatGPT conversations embedded in GitHub issue threads to understand what makes such conversations helpful for issue resolution. By combining qualitative annotation with quantitative signals across conversational, project, and issue dimensions, it identifies that only 62% of conversations are helpful, with code generation and API/tool recommendations showing the strongest performance while code explanations lag. Helpful interactions are characterized by concise, well-structured prompts and strong semantic alignment and project context, while larger, more active projects and experienced developers derive greater benefit. The findings offer guidance on interaction strategies, prompt-design tools, and potential fine-tuning or retrieval-augmented approaches to improve LLM-assisted issue resolution in software engineering.
Abstract
Conversational large-language models are extensively used for issue resolution tasks. However, not all developer-LLM conversations are useful for effective issue resolution. In this paper, we analyze 686 developer-ChatGPT conversations shared within GitHub issue threads to identify characteristics that make these conversations effective for issue resolution. First, we analyze the conversations and their corresponding issues to distinguish helpful from unhelpful conversations. We begin by categorizing the types of tasks developers seek help with to better understand the scenarios in which ChatGPT is most effective. Next, we examine a wide range of conversational, project, and issue-related metrics to uncover factors associated with helpful conversations. Finally, we identify common deficiencies in unhelpful ChatGPT responses to highlight areas that could inform the design of more effective developer-facing tools. We found that only 62% of the ChatGPT conversations were helpful for successful issue resolution. ChatGPT is most effective for code generation and tools/libraries/APIs recommendations, but struggles with code explanations. Helpful conversations tend to be shorter, more readable, and exhibit stronger semantic and linguistic alignment. Larger, more popular projects and more experienced developers benefit more from ChatGPT. At the issue level, ChatGPT performs best on simpler problems with limited developer activity and faster resolution, typically well-scoped tasks like compilation errors. The most common deficiencies in unhelpful ChatGPT responses include incorrect information and lack of comprehensiveness. Our findings have wide implications including guiding developers on effective interaction strategies for issue resolution, informing the development of tools or frameworks to support optimal prompt design, and providing insights on fine-tuning LLMs for issue resolution tasks.
