Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
Bestoun S. Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, Peter Magnusson
TL;DR
The paper addresses the quality assurance challenges of LLM-RAG systems in tourism, where non-determinism and reliance on retrieved context can affect reliability. It proposes a comprehensive testing framework with 17 metrics applied to three LLM variants under varied temperature and top-p settings, using a Värmland tourism case study and the Evidently platform for automated evaluation. Key contributions include adapting RAG-focused evaluation methods to a tourism domain, analyzing how architectural choices and parameter configurations influence both functional and extra-functional properties, and providing practical guidelines for production deployment. Findings show that newer LLMs offer modest gains mainly in output length and complexity, while RAG primarily enhances factual/domain accuracy; extreme parameter settings sharply degrade quality, underscoring the need for careful operational boundaries. The work offers actionable QA practices for organizations deploying LLM-RAG in domain-specific contexts and sets a foundation for broader evaluation across architectures and more granular domain metrics.
Abstract
This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the Värmland region, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.
