Table of Contents
Fetching ...

Environment Maps: Structured Environmental Representations for Long-Horizon Agents

Yenchia Feng, Chirag Sharma, Karime Maamari

Abstract

Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces $\textit{Environment Maps}$: a persistent, agent-agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session-bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.

Environment Maps: Structured Environmental Representations for Long-Horizon Agents

Abstract

Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces : a persistent, agent-agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session-bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.
Paper Structure (43 sections, 5 figures, 1 table)

This paper contains 43 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Environment map framework. The system extracts structured spatial and semantic knowledge from multimodal data (e.g., traces and videos) to build an environment map. The map serves as a knowledge base for the agent during task execution, informing its action–observation loop. Solid black lines indicate pathways used in our experiments; gray lines indicate optional pathways not used, such as human-in-the-loop edits and map updates from agent traces, disabled to maintain a controlled experimental setting.
  • Figure 2: (a) Overall end-to-end task success on WebArena Verified. Bars correspond to the three experimental conditions (Baseline, Trajectory access, Environment map access). $n{=}812$ tasks. (b) Success rate by environment (single-site tasks). The $x$-axis labels match the benchmark environments from §\ref{['sec:benchmark']}: E-Commerce denotes the Magento Storefront tasks ($n{=}187$), CMS denotes the Magento Admin tasks ($n{=}182$), GitLab denotes GitLab CE ($n{=}180$), Map denotes OpenStreetMap ($n{=}109$), and Reddit denotes the Postmill forum ($n{=}106$). (c) Generalization vs. demonstration coverage. Success rates on tasks that do vs. do not have a human demonstration trace (trace-covered: $n{=}179$; non-trace: $n{=}633$).
  • Figure 3: (a) Aggregate tool usage by experimental condition. Stacked bars show total tool calls by type across all $n{=}812$ tasks. (b) Mean tool calls per task by environment and outcome. Grouped bars show average tool calls for each condition, split by task outcome and environment.
  • Figure 4: Generalized environment map structure. A synthetic map example demonstrating the various node types and edge relationships.
  • Figure 5: GitLab environment map (curated subset). A curated 8-context, 42-action subset of the full 96-context GitLab environment map derived from 41 WebArena task trajectories. This visualization shows representative contexts including the main dashboard, search page, project views, and user settings. The density of connections illustrates the navigational structure captured from real agent interactions, while the mix of taken (orange) and potential (green) actions demonstrates how parameterization expands the action space beyond observed instances.