Table of Contents
Fetching ...

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First

Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, Aditya G. Parameswaran

TL;DR

The paper argues that future data systems must be redesigned to natively support agentic workloads driven by LLM agents. It introduces an agent-first architecture with probes, an in-database interpreter, a probe optimizer, and an agentic memory store to enable high-throughput exploration, grounding, and branching. Case studies show that agentic speculation can improve accuracy and reduce effort through redundancy sharing and grounding hints. The work outlines challenges and opportunities in interface design, query processing, and storage, charting a path toward scalable, steerable data systems for AI-powered decision making.

Abstract

Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First

TL;DR

The paper argues that future data systems must be redesigned to natively support agentic workloads driven by LLM agents. It introduces an agent-first architecture with probes, an in-database interpreter, a probe optimizer, and an agentic memory store to enable high-throughput exploration, grounding, and branching. Case studies show that agentic speculation can improve accuracy and reduce effort through redundancy sharing and grounding hints. The work outlines challenges and opportunities in interface design, query processing, and storage, charting a path toward scalable, steerable data systems for AI-powered decision making.

Abstract

Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.

Paper Structure

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results on the BIRD dataset
  • Figure 2: Total vs. unique subexpressions (count and proportion) across 50 attempts generated by GPT-4o-mini per problem, aggregated over the full BIRD dataset. Here, PR=Projection, TS=Scan, FI=Filter, HJ=Hash Join, UA=Aggregate, OT=other operations.
  • Figure 3: Labeled agent activities, with x-axis showing normalized position in the trace, and each row (activity) normalized independently. Agents first explore table and columns then formulate queries, with phases often overlapping.
  • Figure 4: Agent-First Data Systems Architecture; components that are dashed involve LLM agents. Boxes in pink are covered in \ref{['sec:interface']}; blue in \ref{['sec:queryopt']}; orange in \ref{['sec:storage']}.