Table of Contents
Fetching ...

Validity Is What You Need

Sebastian Benthall, Andrew Clark

TL;DR

The paper reframes Agentic AI as a SaaS-like service deployed in complex enterprise settings, arguing that practical validity hinges on application-level validation rather than solely on foundation-model capabilities, and offers a realist definition and design framework. It introduces a multi-stage design and governance process grounded in mechanism design, addressing three information-theoretic validation challenges: deployment-specific information gaps, designer knowledge gaps, and stakeholder confidence, with an emphasis on end-to-end validation and guardrails. The authors contend that strong validation can reduce reliance on large foundation models by favoring smaller, interpretable components or expert systems, formalized by goal-directed decision making akin to the Bellman framework $V(x) = \max_{a \in \Gamma(x)} \{ F(x,a) + \beta V(T(x,a)) \}$. Practically, the paper outlines concrete steps for enterprise validation—from modeling the sociotechnical context to continuous monitoring—arguing that application evaluations, not just model capabilities, will determine the maturity and value of Agentic AI in real-world use cases.

Abstract

While AI agents have long been discussed and studied in computer science, today's Agentic AI systems are something new. We consider other definitions of Agentic AI and propose a new realist definition. Agentic AI is a software delivery mechanism, comparable to software as a service (SaaS), which puts an application to work autonomously in a complex enterprise setting. Recent advances in large language models (LLMs) as foundation models have driven excitement in Agentic AI. We note, however, that Agentic AI systems are primarily applications, not foundations, and so their success depends on validation by end users and principal stakeholders. The tools and techniques needed by the principal users to validate their applications are quite different from the tools and techniques used to evaluate foundation models. Ironically, with good validation measures in place, in many cases the foundation models can be replaced with much simpler, faster, and more interpretable models that handle core logic. When it comes to Agentic AI, validity is what you need. LLMs are one option that might achieve it.

Validity Is What You Need

TL;DR

The paper reframes Agentic AI as a SaaS-like service deployed in complex enterprise settings, arguing that practical validity hinges on application-level validation rather than solely on foundation-model capabilities, and offers a realist definition and design framework. It introduces a multi-stage design and governance process grounded in mechanism design, addressing three information-theoretic validation challenges: deployment-specific information gaps, designer knowledge gaps, and stakeholder confidence, with an emphasis on end-to-end validation and guardrails. The authors contend that strong validation can reduce reliance on large foundation models by favoring smaller, interpretable components or expert systems, formalized by goal-directed decision making akin to the Bellman framework . Practically, the paper outlines concrete steps for enterprise validation—from modeling the sociotechnical context to continuous monitoring—arguing that application evaluations, not just model capabilities, will determine the maturity and value of Agentic AI in real-world use cases.

Abstract

While AI agents have long been discussed and studied in computer science, today's Agentic AI systems are something new. We consider other definitions of Agentic AI and propose a new realist definition. Agentic AI is a software delivery mechanism, comparable to software as a service (SaaS), which puts an application to work autonomously in a complex enterprise setting. Recent advances in large language models (LLMs) as foundation models have driven excitement in Agentic AI. We note, however, that Agentic AI systems are primarily applications, not foundations, and so their success depends on validation by end users and principal stakeholders. The tools and techniques needed by the principal users to validate their applications are quite different from the tools and techniques used to evaluate foundation models. Ironically, with good validation measures in place, in many cases the foundation models can be replaced with much simpler, faster, and more interpretable models that handle core logic. When it comes to Agentic AI, validity is what you need. LLMs are one option that might achieve it.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: The supply chain of Agentic and other AI applications that use an underlying foundation model like an LLM. There are three sources of data for the AI application: pretraining data for the LLM, finetuning (FT) data for a finetuned model, and user data. What matters for the user is the value of the total application. The question facing the AI application designer is what information, and which data sources, are useful for delivering value in the enterprise context. Since dependence on a large foundation model can involve additional expenses (such as inference costs on large models, or licensing fees on training data, which are passed through to the downstream users), there is an incentive for the Agentic AI designer to minimize dependence on larger models.