Table of Contents
Fetching ...

SCUBA: Salesforce Computer Use Benchmark

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

TL;DR

SCUBA introduces a realistic Salesforce CRM benchmark to evaluate computer-use agents across admin, sales, and service workflows inside sandbox environments. It combines 300 tasks derived from real user interviews with a rule-based, milestone-centric evaluation, and supports asynchronous parallel testing, knowledge articles, and human demonstrations to boost performance. The study reveals large gaps between open-source and closed-source models and shows that demonstration augmentation can raise task success while reducing time and costs, indicating both the challenges and potential paths for enterprise automation. By providing a multi-dimensional, interpretable framework tailored to enterprise software ecosystems, SCUBA aims to accelerate development of reliable, scalable computer-use agents for complex CRM ecosystems.

Abstract

We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.

SCUBA: Salesforce Computer Use Benchmark

TL;DR

SCUBA introduces a realistic Salesforce CRM benchmark to evaluate computer-use agents across admin, sales, and service workflows inside sandbox environments. It combines 300 tasks derived from real user interviews with a rule-based, milestone-centric evaluation, and supports asynchronous parallel testing, knowledge articles, and human demonstrations to boost performance. The study reveals large gaps between open-source and closed-source models and shows that demonstration augmentation can raise task success while reducing time and costs, indicating both the challenges and potential paths for enterprise automation. By providing a multi-dimensional, interpretable framework tailored to enterprise software ecosystems, SCUBA aims to accelerate development of reliable, scalable computer-use agents for complex CRM ecosystems.

Abstract

We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.

Paper Structure

This paper contains 38 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: SCUBA tasks, environment, and agent trajectory preview.
  • Figure 2: Sandbox environment of the Salesforce Platform
  • Figure 3: Left: Difficulty distribution by splits. Middle: A sample task configuration files. The light blue section contains basic information on the task; the light green section docuements details for initialization; the light orange section highlights the inputs used for rule-based evaluation; the gray section can be used to manipulate the environment before and after agent run, leaving the freedom to configure different initial states and perform post-process. Right: A sample evaluation result.The light purple section indicates if the task is successful. The light yellow section lists detailed milestone scores and rubrics.
  • Figure 4: Average number of atomic actions used by different methods. Within each method, the left corresponds to the zero-shot setting and the right is the demonstration-augmented setting. The gray dashed lines is the human average.
  • Figure 5: Left: Costs v.s. Success Rate. Right: Time v.s. Success Rate. Orange squares and blue circles represent the browser-use agents and computer-use agents' performance metrics under zero-shot setting. The arrow points the performance metrics under the demonstration-augmented setting. The arrow that points to top-left are desired, since it means improvements. Green zone means low costs/latency and high success rates zone; vice versa for the red zone.
  • ...and 9 more figures