Table of Contents
Fetching ...

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Weihao Zeng, Yuzhen Huang, Junxian He

TL;DR

LOCA-bench addresses the challenge of context rot in long-horizon language tasks by offering a benchmark that scales environment context controllably while preserving task semantics. It combines a suite of mock-service environments, seed tasks, and a modular scaffold to evaluate how models and context-management strategies handle expanding context lengths, from 8K to 256K tokens. Key findings show that most models suffer performance drops as context grows, with frontier models maintaining higher accuracy and benefiting more from context engineering, especially programmatic tool calling and memory-aware strategies. By open-sourcing the toolkit and decoupling environment, models, and scaffolds, LOCA-bench provides a practical platform to guide future development of long-context agents and their training and inference pipelines.

Abstract

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

TL;DR

LOCA-bench addresses the challenge of context rot in long-horizon language tasks by offering a benchmark that scales environment context controllably while preserving task semantics. It combines a suite of mock-service environments, seed tasks, and a modular scaffold to evaluate how models and context-management strategies handle expanding context lengths, from 8K to 256K tokens. Key findings show that most models suffer performance drops as context grows, with frontier models maintaining higher accuracy and benefiting more from context engineering, especially programmatic tool calling and memory-aware strategies. By open-sourcing the toolkit and decoupling environment, models, and scaffolds, LOCA-bench provides a practical platform to guide future development of long-context agents and their training and inference pipelines.

Abstract

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench
Paper Structure (28 sections, 8 figures, 7 tables)

This paper contains 28 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of results.Left: Accuracy changes across models as the environment description length increases. Right: Accuracy gains from different context engineering strategies for Gemini-3-Flash and GPT-5.2-Medium at 128K environment description length.
  • Figure 2: Illustration of the task generation pipeline. The figure shows an example of constructing a task that involves reading final-exam information from Canvas and email. From left to right, it shows how benchmark users set environment configuration parameters, such as the number of courses and the proportion of Canvas announcements versus email notifications. A programmatic generator then uses predefined templates for courses, exams, announcements, and emails to instantiate matching environment states -- such as specific Canvas course pages, announcements, and email messages -- and inserts them into the server.
  • Figure 3: Impact of environment description length on (a) trajectory length, (b) number of tool calls, and (c) tool output length.
  • Figure 4: An example of insufficient exploration. The task is to identify all products that satisfy the criteria and save them to a CSV file in the workspace. However, the agent fetches only the first 100 products and finds no matches in that subset. It then stops without checking the remaining catalog, writes nothing to the CSV, and the output does not match the ground-truth CSV, causing the evaluation to fail. We highlight the failed goal, the failure-related tool call, and the mismatched final workspace in red.
  • Figure 5: An example of declining complex reasoning. The task requires the model to gather final exam details from both Canvas announcements and email notifications, then link each exam to its corresponding course in Canvas. However, the model ignores the exam information contained in emails and never consults the Canvas dashboard for course identifiers. As a result, it writes only the exams mentioned in Canvas announcements into the Excel file. Since the ground truth includes exam information from both announcements and emails, this omission causes the evaluation to fail. We highlight the failed goal, the failure-related tool call, and the mismatched final workspace in red.
  • ...and 3 more figures