Table of Contents
Fetching ...

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, Chien-Sheng Wu

TL;DR

<This paper presents CRMArena, a realistic CRM agent benchmark grounded in the Salesforce Service Cloud schema to evaluate LLM-based agents on nine real-world CRM tasks across three roles. It details a data-generation pipeline with latent variables, a sandbox Org, and expert validation, plus a suite of tools (SOQL/SOSL and 27 Python wrappers) to support evaluation under multiple agent frameworks (Act, ReAct, Function Calling). The study reveals that state-of-the-art LLMs perform modestly (e.g., ~57-64% across settings), with stronger models benefiting from manually crafted tools and function calls but weaker models sometimes deteriorating with such tools, highlighting the need for improved function-calling and policy-following. CRMArena thus serves as an open, expert-validated challenge that can drive progress toward deploying reliable AI agents in realistic CRM environments.

Abstract

Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

TL;DR

<This paper presents CRMArena, a realistic CRM agent benchmark grounded in the Salesforce Service Cloud schema to evaluate LLM-based agents on nine real-world CRM tasks across three roles. It details a data-generation pipeline with latent variables, a sandbox Org, and expert validation, plus a suite of tools (SOQL/SOSL and 27 Python wrappers) to support evaluation under multiple agent frameworks (Act, ReAct, Function Calling). The study reveals that state-of-the-art LLMs perform modestly (e.g., ~57-64% across settings), with stronger models benefiting from manually crafted tools and function calls but weaker models sometimes deteriorating with such tools, highlighting the need for improved function-calling and policy-following. CRMArena thus serves as an open, expert-validated challenge that can drive progress toward deploying reliable AI agents in realistic CRM environments.

Abstract

Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

Paper Structure

This paper contains 57 sections, 11 figures, 15 tables.

Figures (11)

  • Figure 1: An overview of the contribution of this work. We begin by generating realistic CRM data based on the Salesforce Service Cloud schema, ensuring both quality and diversity through rigorous verification processes. This verified data is then stored locally and uploaded to a Salesforce organization (Org). An expert study, conducted with domain experts, validated the data's realism. Using this Org as a sandbox environment, we create query instances and benchmark various LLMs across different agentic frameworks.
  • Figure 2: An overview of the nine distinct tasks introduced in CRMArena.
  • Figure 3: Examples of latent variables (gray) influencing object (blue) generation. (a) The ShoppingHabit variable affects when and what type of products a customer buys. (b) The Skill variable determines if an agent can handle a case or needs to transfer it.
  • Figure 4: Expert study results. The plots illustrate domain experts' realism ratings for the overall Org structure (top) and individual objects we generated (bottom).
  • Figure 5: The consistency of gpt-4o using different agentic frameworks.
  • ...and 6 more figures