Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products
Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, Christian Bird
TL;DR
The paper addresses the challenge of integrating large language models into software products by identifying disruptions to traditional software engineering and quality assurance. Using a sequential mixed-methods design with 26 interviews and 332 survey responses from Microsoft teams, it uncovers 19 emerging QA solutions (and 7 related to development/prompting) that practitioners are adopting to manage LLM-specific complexities. Key contributions include validating known challenges and detailing practical solutions such as combining qualitative and quantitative evaluation, employing LLMs as validators, implementing extensive guardrails, automating offline evaluation, and adopting canary releases and A/B testing. The findings offer actionable guidance for development and evaluation of LLM-based products and set the stage for cross-context evaluation of these practices. The work highlights the need for formal assessment of the effectiveness of these solutions and invites broader experimentation across organizations and product domains.
Abstract
Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.
