Table of Contents
Fetching ...

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Axel Backlund, Lukas Petersson

TL;DR

Vending-Bench presents a long-horizon benchmark that tests LLM-driven agents on a simplified but extended business task: operating a vending machine. The authors design a modular environment with supplier communication and customer demand, augmented by a sub-agent for physical actions, and evaluate multiple LLMs across thousands of interactions to assess sustained coherence. Key findings show substantial performance variance across runs and models, with some models achieving high mean net worth but frequent failures; memory length and context window fullness do not fully account for degradation. The work contributes an open, scalable testbed for long-term AI safety and capability assessment, illustrating both the potential and pitfalls of autonomous LLM-based agents in persistent tasks.

Abstract

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

TL;DR

Vending-Bench presents a long-horizon benchmark that tests LLM-driven agents on a simplified but extended business task: operating a vending machine. The authors design a modular environment with supplier communication and customer demand, augmented by a sub-agent for physical actions, and evaluate multiple LLMs across thousands of interactions to assess sustained coherence. Key findings show substantial performance variance across runs and models, with some models achieving high mean net worth but frequent failures; memory length and context window fullness do not fully account for degradation. The work contributes an open, scalable testbed for long-term AI safety and capability assessment, illustrating both the potential and pitfalls of autonomous LLM-based agents in persistent tasks.

Abstract

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

Paper Structure

This paper contains 23 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview of Vending-Bench
  • Figure 2: Setup of supplier communication
  • Figure 3: Mean scores over simulation days for primary models, with $\pm$ 1 standard deviation of the daily score of the five samples indicated as a shaded area centered around the mean
  • Figure 4: Mean tool use of primary models per run, with confidence intervals as $\pm$ 1 standard deviation of the five samples
  • Figure 5: Key metrics for o3-mini and Claude 3.5 Sonnet, with individual runs marked as gray lines, and mean as dashed black line
  • ...and 8 more figures