Table of Contents
Fetching ...

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu

TL;DR

CRMArena-Pro tackles the lack of realistic, multi-turn CRM benchmarks by extending CRMArena to B2B and B2C settings with 19 tasks across four business skills and an explicit confidentiality-awareness evaluation. It builds a Salesforce-based sandbox with synthetic enterprise data, expert validation, and multi-turn interactions to comprehensively assess LLM agents. Empirical results show modest single-turn efficacy (about 58%) deteriorating in multi-turn settings (about 35%), with Workflow Execution being comparatively tractable and confidentiality awareness remaining near-zero unless prompts are adjusted, often trading off task performance. The work provides a realistic, scalable testbed to guide future improvements in reasoning, data handling, and security-conscious behavior in enterprise LLM agents.

Abstract

While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

TL;DR

CRMArena-Pro tackles the lack of realistic, multi-turn CRM benchmarks by extending CRMArena to B2B and B2C settings with 19 tasks across four business skills and an explicit confidentiality-awareness evaluation. It builds a Salesforce-based sandbox with synthetic enterprise data, expert validation, and multi-turn interactions to comprehensively assess LLM agents. Empirical results show modest single-turn efficacy (about 58%) deteriorating in multi-turn settings (about 35%), with Workflow Execution being comparatively tractable and confidentiality awareness remaining near-zero unless prompts are adjusted, often trading off task performance. The work provides a realistic, scalable testbed to guide future improvements in reasoning, data handling, and security-conscious behavior in enterprise LLM agents.

Abstract

While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

Paper Structure

This paper contains 84 sections, 39 figures, 6 tables.

Figures (39)

  • Figure 1: An overview of CRMArena-Pro's key extensions to the original CRMArena benchmark. These include expanding business scenarios to cover B2B organizations, broadening business applications to include Sales and CPQ, and incorporating multi-turn dialogues and confidentiality-awareness evaluations.
  • Figure 2: Illustration of the CRMArena-Pro environment setup. The data generation pipeline produces realistic synthetic data to be populated to a Salesforce Org, which serves as the sandbox environment. The agent takes user queries and decides between two type of actions (1) API calls to the Salesforce Org for fetching relevant data or (2) respond to the users to seek further clarification or provide answers.
  • Figure 3: An overview of the nineteen distinct tasks categorized into four business skills covering three business scenarios: customer service, sales, and CPQ.
  • Figure 4: Results of the expert studies conducted on the two Orgs we populated. Overall, experts consider our generated data and sandbox environment to be realistic and close to real-world settings .
  • Figure 5: Trade-off between performance and cost for different LLMs. The top-left corner indicates the high-value region characterized by high performance or low costs.
  • ...and 34 more figures