CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Kung-Hsiang Huang; Akshara Prabhakar; Sidharth Dhawan; Yixin Mao; Huan Wang; Silvio Savarese; Caiming Xiong; Philippe Laban; Chien-Sheng Wu

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, Chien-Sheng Wu

TL;DR

<This paper presents CRMArena, a realistic CRM agent benchmark grounded in the Salesforce Service Cloud schema to evaluate LLM-based agents on nine real-world CRM tasks across three roles. It details a data-generation pipeline with latent variables, a sandbox Org, and expert validation, plus a suite of tools (SOQL/SOSL and 27 Python wrappers) to support evaluation under multiple agent frameworks (Act, ReAct, Function Calling). The study reveals that state-of-the-art LLMs perform modestly (e.g., ~57-64% across settings), with stronger models benefiting from manually crafted tools and function calls but weaker models sometimes deteriorating with such tools, highlighting the need for improved function-calling and policy-following. CRMArena thus serves as an open, expert-validated challenge that can drive progress toward deploying reliable AI agents in realistic CRM environments.

Abstract

Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

TL;DR

Abstract

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)