Table of Contents
Fetching ...

Self-Improving Customer Review Response Generation Based on LLMs

Guy Azov, Tatiana Pelc, Adi Fledel Alon, Gila Kamhi

TL;DR

The paper introduces SCRABLE, a self-improving, LLM-driven system for automatic customer-review response generation that uses retrieval-augmented generation to ground replies in a knowledge base. A novel LLM-as-a-Judge evaluates responses on relevancy, specificity, accuracy, and grammar, guiding iterative prompt optimization and automatic scoring. The approach combines offline and online RAG with category-specific prompts, achieving an $8.5\%$ improvement over baselines and stronger alignment with human judgments ($3$–$5$ fold) compared to prior self-improvement methods. Empirical results on real-world data, plus human evaluation and a validation set of new reviews, demonstrate SCRABLE’s effectiveness and practical applicability, with clear guidance on data refresh and retraining for long-term performance.

Abstract

Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced Large Language Models (LLMs). Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.

Self-Improving Customer Review Response Generation Based on LLMs

TL;DR

The paper introduces SCRABLE, a self-improving, LLM-driven system for automatic customer-review response generation that uses retrieval-augmented generation to ground replies in a knowledge base. A novel LLM-as-a-Judge evaluates responses on relevancy, specificity, accuracy, and grammar, guiding iterative prompt optimization and automatic scoring. The approach combines offline and online RAG with category-specific prompts, achieving an improvement over baselines and stronger alignment with human judgments ( fold) compared to prior self-improvement methods. Empirical results on real-world data, plus human evaluation and a validation set of new reviews, demonstrate SCRABLE’s effectiveness and practical applicability, with clear guidance on data refresh and retraining for long-term performance.

Abstract

Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced Large Language Models (LLMs). Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.
Paper Structure (24 sections, 2 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Prompt Optimization Flow driven via feedback of LLM as a judge utility
  • Figure 2: ScoredResponseGen : Given reviews of interest, a prompt and optionally respective expert responses, LLM predicts responses via (ResponseGen) utility. Scoring of the quality of the response and improvement suggestions are handled via (Judge) utility. The feedback ouput is a list of score and suggestions pairs for each review.
  • Figure 3: Our Retrieval Augmented Generation Pipeline
  • Figure 4: Iterative Self-Improving Response Generation Step