Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse
Jacopo Tagliabue, Ciro Greco
TL;DR
The work tackles how to enable safe, trustworthy agent-driven automation for data lakehouses by proposing API-first, programmable abstractions that expose the entire data lifecycle. It argues that declarative DAGs, Git-for-Data-like branching, and code-as-interface support reproducibility, observability, and safety, even in the presence of untrusted agents. A proof-of-concept demonstrates self-repair of production pipelines using Bauplan, MCP, and a verifier, showing that untrusted AI agents can operate without compromising production. The study outlines a path toward a fully agentic lakehouse, while identifying future challenges such as scalability and parallelism in OLAP contexts.
Abstract
Data lakehouses run sensitive workloads, where AI-driven automation raises concerns about trust, correctness, and governance. We argue that API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic workflows. Using Bauplan as a case study, we show how data branching and declarative environments extend naturally to agents, enabling reproducibility and observability while reducing the attack surface. We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code. Our prototype demonstrates that untrusted AI agents can operate safely on production data and outlines a path toward a fully agentic lakehouse.
