Table of Contents
Fetching ...

OMuleT: Orchestrating Multiple Tools for Practicable Conversational Recommendation

Se-eun Yoon, Xiaokai Wei, Yexi Jiang, Rachit Pareek, Frank Ong, Kevin Gao, Julian McAuley, Michelle Gong

TL;DR

OMuleT addresses practical CRS by equipping LLMs with over $10$ tools to process real user requests and provide diverse, relevant recommendations. The approach creates a dataset of real user requests, uses a formatted-intent representation plus a handcrafted tool-execution policy to augment LLM outputs, and then generates final recommendations that link to Roblox data. It evaluates two LLMs (LLaMA-405B and GPT-4o) with eight metrics ($8$ metrics) covering relevance, novelty, coverage, and factuality, and includes ablation studies showing the necessity of the full toolbox and policy choices. Deployment insights from an internal alpha discuss safety, latency, and scalability for production CRS.

Abstract

In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a more extensive toolbox is necessary to effectively handle real user requests. As such, we propose a novel approach that equips LLMs with over 10 tools, providing them access to the internal knowledge base and API calls used in production. We evaluate our model on a dataset of real users and show that it generates relevant, novel, and diverse recommendations compared to vanilla LLMs. Furthermore, we conduct ablation studies to demonstrate the effectiveness of using the full range of tools in our toolbox. We share our designs and lessons learned from deploying the system for internal alpha release. Our contribution is the addressing of all four key aspects of a practicable CRS: (1) real user requests, (2) augmenting LLMs with a wide variety of tools, (3) extensive evaluation, and (4) deployment insights.

OMuleT: Orchestrating Multiple Tools for Practicable Conversational Recommendation

TL;DR

OMuleT addresses practical CRS by equipping LLMs with over tools to process real user requests and provide diverse, relevant recommendations. The approach creates a dataset of real user requests, uses a formatted-intent representation plus a handcrafted tool-execution policy to augment LLM outputs, and then generates final recommendations that link to Roblox data. It evaluates two LLMs (LLaMA-405B and GPT-4o) with eight metrics ( metrics) covering relevance, novelty, coverage, and factuality, and includes ablation studies showing the necessity of the full toolbox and policy choices. Deployment insights from an internal alpha discuss safety, latency, and scalability for production CRS.

Abstract

In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a more extensive toolbox is necessary to effectively handle real user requests. As such, we propose a novel approach that equips LLMs with over 10 tools, providing them access to the internal knowledge base and API calls used in production. We evaluate our model on a dataset of real users and show that it generates relevant, novel, and diverse recommendations compared to vanilla LLMs. Furthermore, we conduct ablation studies to demonstrate the effectiveness of using the full range of tools in our toolbox. We share our designs and lessons learned from deploying the system for internal alpha release. Our contribution is the addressing of all four key aspects of a practicable CRS: (1) real user requests, (2) augmenting LLMs with a wide variety of tools, (3) extensive evaluation, and (4) deployment insights.

Paper Structure

This paper contains 27 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of recommendation requests from users.
  • Figure 2: Our dataset collection process.
  • Figure 3: Overview of OMuleT. Orange boxes are in the user interface (a user inputs a raw request and observes recommended items); blue boxes are where LLMs are used; green boxes are where tools are used.
  • Figure 4: Ablation study. Recommendations are more relevant if we use more tools. One exception is when removing the search tool for GPT-4o: relevance increases, but this comes at a relatively large cost to both novelty (1$-$Pop50) and diversity (Entropy). Above are results for $k=5$ and we observe similar trends for different $k$ values.
  • Figure 5: Screenshot of the deployed UI. We add simple greeting and explanations for a more natural conversation, and thumbs up and down buttons for obtaining feedback.