Table of Contents
Fetching ...

Hybrid LLM Routing for Efficient App Feedback Classification

Yasaman Abedini, Abbas Heydarnoori

TL;DR

This work addresses the problem of scalable, cost-conscious app feedback classification by evaluating zero-shot performance of diverse LLMs across eight datasets and three platforms, and by introducing a two-tier routing strategy. The method combines lightweight fine-tuned models for straightforward cases with a high-capacity LLM for ambiguous instances, optimizing a formal objective that maximizes predictive performance $\\mathcal{P}$ while minimizing cost $\\mathcal{C}$. Key contributions include a comprehensive zero-shot analysis of eight datasets with four LLMs under original and coarse-grained schemes, the first hybrid LLM routing design tailored to app feedback, and empirical evidence of substantial cost savings (e.g., 67.8% fewer requests, 66.3% fewer tokens) while preserving $98.4$%- $100.4$% of zero-shot accuracy. The results demonstrate the practical viability of deploying scalable feedback classification in popular apps and provide replication artifacts to facilitate future research.

Abstract

The emergence of large language models (LLMs), pre-trained on massive datasets, has demonstrated strong performance across a wide range of natural language processing (NLP) tasks, including text classification. While prior studies have examined the use of LLMs for predicting the intent of user feedback and reported encouraging results, these investigations remain limited in scope. Furthermore, the vast volume of feedback posted daily, particularly for popular applications, combined with the computational and financial overhead of commercial LLMs, renders large-scale deployment impractical. In contrast, smaller models provide greater efficiency and lower cost but generally at the expense of reduced accuracy. In this paper, we aim to balance accuracy and efficiency in feedback classification. We first present a comprehensive study of zero-shot classification using four widely adopted LLMs, GPT-3.5-Turbo, GPT-4o, Flan-T5, and Llama3-70B, on diverse feedback datasets collected from multiple platforms, including app stores, forums, and X, which are categorized under different schemes. This analysis reveals how classification scheme design and platform characteristics influence the predictive performance of LLMs. Building on these insights, we propose a two-tier routing strategy for scalable app store feedback classification. In this approach, low-complexity instances are processed by lightweight fine-tuned models, while ambiguous cases are routed to high-capacity LLMs for more reliable decisions. Experimental results show that this strategy retains 98.4% to 100.4% of zero-shot LLM accuracy while reducing request and token costs by 67.8% and 66.3%, respectively.

Hybrid LLM Routing for Efficient App Feedback Classification

TL;DR

This work addresses the problem of scalable, cost-conscious app feedback classification by evaluating zero-shot performance of diverse LLMs across eight datasets and three platforms, and by introducing a two-tier routing strategy. The method combines lightweight fine-tuned models for straightforward cases with a high-capacity LLM for ambiguous instances, optimizing a formal objective that maximizes predictive performance while minimizing cost . Key contributions include a comprehensive zero-shot analysis of eight datasets with four LLMs under original and coarse-grained schemes, the first hybrid LLM routing design tailored to app feedback, and empirical evidence of substantial cost savings (e.g., 67.8% fewer requests, 66.3% fewer tokens) while preserving %- % of zero-shot accuracy. The results demonstrate the practical viability of deploying scalable feedback classification in popular apps and provide replication artifacts to facilitate future research.

Abstract

The emergence of large language models (LLMs), pre-trained on massive datasets, has demonstrated strong performance across a wide range of natural language processing (NLP) tasks, including text classification. While prior studies have examined the use of LLMs for predicting the intent of user feedback and reported encouraging results, these investigations remain limited in scope. Furthermore, the vast volume of feedback posted daily, particularly for popular applications, combined with the computational and financial overhead of commercial LLMs, renders large-scale deployment impractical. In contrast, smaller models provide greater efficiency and lower cost but generally at the expense of reduced accuracy. In this paper, we aim to balance accuracy and efficiency in feedback classification. We first present a comprehensive study of zero-shot classification using four widely adopted LLMs, GPT-3.5-Turbo, GPT-4o, Flan-T5, and Llama3-70B, on diverse feedback datasets collected from multiple platforms, including app stores, forums, and X, which are categorized under different schemes. This analysis reveals how classification scheme design and platform characteristics influence the predictive performance of LLMs. Building on these insights, we propose a two-tier routing strategy for scalable app store feedback classification. In this approach, low-complexity instances are processed by lightweight fine-tuned models, while ambiguous cases are routed to high-capacity LLMs for more reliable decisions. Experimental results show that this strategy retains 98.4% to 100.4% of zero-shot LLM accuracy while reducing request and token costs by 67.8% and 66.3%, respectively.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Sample prompt for zero-shot classification of user feedback in dataset DS2
  • Figure 2: Cross-category misclassification percentages in DS1 and DS5
  • Figure 3: Comparison of LLM results between the original and coarse-grained classification schemes across precision, recall, and F1.
  • Figure 4: Proposed two-tier LLM routing strategy for user feedback classification