Table of Contents
Fetching ...

Automating the Enterprise with Foundation Models

Michael Wornow, Avanika Narayan, Krista Opsahl-Ong, Quinn McIntyre, Nigam H. Shah, Christopher Re

TL;DR

Automating enterprise workflows remains challenging due to high setup costs, brittle rule-based execution, and ongoing maintenance in traditional RPA. The authors propose ECLAIR, a multimodal foundation-model framework that learns from video demonstrations and SOPs to Demonstrate, Execute, and Validate GUI-based workflows with minimal human supervision. Their case studies and WebArena-based evaluations show strong per-step understanding (approx. 0.93 step accuracy) and meaningful end-to-end performance (up to ~0.92 with guidance, ~0.40 without) along with substantive self-monitoring capabilities, though grounding to precise GUI elements and low-level validation require further work. If refined, ECLAIR could enable end-to-end automation across categories of tacit, decision-rich workflows, potentially unlocking large productivity gains in enterprise settings.

Abstract

Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite being of interest to the data management community for decades, the ultimate vision of end-to-end workflow automation has remained elusive. Current solutions rely on process mining and robotic process automation (RPA), in which a bot is hard-coded to follow a set of predefined rules for completing a workflow. Through case studies of a hospital and large B2B enterprise, we find that the adoption of RPA has been inhibited by high set-up costs (12-18 months), unreliable execution (60% initial accuracy), and burdensome maintenance (requiring multiple FTEs). Multimodal foundation models (FMs) such as GPT-4 offer a promising new approach for end-to-end workflow automation given their generalized reasoning and planning abilities. To study these capabilities we propose ECLAIR, a system to automate enterprise workflows with minimal human supervision. We conduct initial experiments showing that multimodal FMs can address the limitations of traditional RPA with (1) near-human-level understanding of workflows (93% accuracy on a workflow understanding task) and (2) instant set-up with minimal technical barrier (based solely on a natural language description of a workflow, ECLAIR achieves end-to-end completion rates of 40%). We identify human-AI collaboration, validation, and self-improvement as open challenges, and suggest ways they can be solved with data management techniques. Code is available at: https://github.com/HazyResearch/eclair-agents

Automating the Enterprise with Foundation Models

TL;DR

Automating enterprise workflows remains challenging due to high setup costs, brittle rule-based execution, and ongoing maintenance in traditional RPA. The authors propose ECLAIR, a multimodal foundation-model framework that learns from video demonstrations and SOPs to Demonstrate, Execute, and Validate GUI-based workflows with minimal human supervision. Their case studies and WebArena-based evaluations show strong per-step understanding (approx. 0.93 step accuracy) and meaningful end-to-end performance (up to ~0.92 with guidance, ~0.40 without) along with substantive self-monitoring capabilities, though grounding to precise GUI elements and low-level validation require further work. If refined, ECLAIR could enable end-to-end automation across categories of tacit, decision-rich workflows, potentially unlocking large productivity gains in enterprise settings.

Abstract

Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite being of interest to the data management community for decades, the ultimate vision of end-to-end workflow automation has remained elusive. Current solutions rely on process mining and robotic process automation (RPA), in which a bot is hard-coded to follow a set of predefined rules for completing a workflow. Through case studies of a hospital and large B2B enterprise, we find that the adoption of RPA has been inhibited by high set-up costs (12-18 months), unreliable execution (60% initial accuracy), and burdensome maintenance (requiring multiple FTEs). Multimodal foundation models (FMs) such as GPT-4 offer a promising new approach for end-to-end workflow automation given their generalized reasoning and planning abilities. To study these capabilities we propose ECLAIR, a system to automate enterprise workflows with minimal human supervision. We conduct initial experiments showing that multimodal FMs can address the limitations of traditional RPA with (1) near-human-level understanding of workflows (93% accuracy on a workflow understanding task) and (2) instant set-up with minimal technical barrier (based solely on a natural language description of a workflow, ECLAIR achieves end-to-end completion rates of 40%). We identify human-AI collaboration, validation, and self-improvement as open challenges, and suggest ways they can be solved with data management techniques. Code is available at: https://github.com/HazyResearch/eclair-agents
Paper Structure (18 sections, 2 figures, 4 tables)

This paper contains 18 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Differences between ECLAIR and traditional RPA. ECLAIR uses FMs to learn expertise via video demonstrations (left), navigate GUIs given written documentation (center), and audit completed workflows (right).
  • Figure 2: ECLAIR can automate entirely new categories of workflows, such as those that contain hard-to-describe steps, require complex decision making, or are knowledge intensive. Listed examples are real-world hospital workflows (see Section 3.1).