Table of Contents
Fetching ...

GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

Shishir G. Patil, Tianjun Zhang, Vivian Fang, Noppapon C., Roy Huang, Aaron Hao, Martin Casado, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica

TL;DR

The paper addresses the risk and reliability gaps in autonomous LLM-powered systems that act on real-world services. It proposes GoEx, a runtime that enables post-facto validation through undo mechanisms and damage confinement, complemented by symbolic credentials, sandboxing, and policy-driven access control. The approach combines RESTful API, database, and filesystem action abstractions with configurable reversibility and testing strategies to bound risk while preserving utility. This work lays a foundation for safe deployment of LLM agents in microservices and enterprise contexts, and it invites further research on API design, traceability, and defenses against LLM-induced errors.

Abstract

Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at https://github.com/ShishirPatil/gorilla/.

GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

TL;DR

The paper addresses the risk and reliability gaps in autonomous LLM-powered systems that act on real-world services. It proposes GoEx, a runtime that enables post-facto validation through undo mechanisms and damage confinement, complemented by symbolic credentials, sandboxing, and policy-driven access control. The approach combines RESTful API, database, and filesystem action abstractions with configurable reversibility and testing strategies to bound risk while preserving utility. This work lays a foundation for safe deployment of LLM agents in microservices and enterprise contexts, and it invites further research on API design, traceability, and defenses against LLM-induced errors.

Abstract

Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at https://github.com/ShishirPatil/gorilla/.
Paper Structure (48 sections, 4 figures)

This paper contains 48 sections, 4 figures.

Figures (4)

  • Figure 1: Evolution of LLMs powered applications and services from chatbots, to decision-making agents that can interact with applications and services with human-supervision, to autonomous LLM-agents interacting with LLM-powered apps and services with minimal and punctuated human supervision.
  • Figure 2: GoEx's runtime for executing RESTful API calls. Upon receiving the user's prompt, GoEx presents two alternatives. First, an LLM can be prompted to come up with the (Action, Undo-Action) pair. Second, the application developer can provide tuples of actions and their corresponding undo-actions (function calls) from which the LLM can pick amongst.
  • Figure 3: Runtime for executing actions on a database. We present two techniques to determine if a proposed action can be undone. On the left, for non-transactional databases like MongoDB, and for flexibility, we prompt the LLM to generate (Action, Undo-Action, test-bed) tuples, which we then evaluate in a isolated container to catch any false (Action, Undo-Action) pairs. On the right, we can provide a deterministic undo with guarantees by employing the transaction semantics of databases.
  • Figure 4: Runtime for executing actions on a filesystem. GoEx presents two abstractions. On the left, the LLM is prompted to come up with an (Action, Undo-Action, test-bed) which GoEx evaluates in a isolated container to catch any false (Action, Undo-Action) pairs. On the right presents deterministic guarantees by using versioning control system like Git or Git LFS.