RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang; Jun Wang; Jingtong Wu; Zhisong Zhang

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

Abstract

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Abstract

Paper Structure (84 sections, 11 equations, 7 figures, 8 tables)

This paper contains 84 sections, 11 equations, 7 figures, 8 tables.

Introduction
Environment Construction
Problem Formulation and Overview
State Space
Action Space
Day Transition Dynamics
Evolving Strategy & Execution Framework
Framework Detail
Hierarchical Policy Representation
Experiment Settings
Environment Configurations
Evaluation Metrics
Metrics.
Experimental Setup
Agent Frameworks.
...and 69 more sections

Figures (7)

Figure 1: Overview of the hierarchical supermarket environment, illustrating intra-day agent–environment interactions and end-of-day state transition dynamics.
Figure 2: Category-level sales and profit per category across Easy, Middle, and Hard environments. Results are shown for three representative models.
Figure 3: Net worth and available funds trajectories of the heuristic policy under different environment configurations, illustrating the calibrated difficulty levels. The Middle and Hard settings involve a larger number of product categories, enabling higher potential net worth and cash accumulation over the course of an episode.
Figure 4: Macro strategy similarity over time in the easy environment. Higher values indicate greater consistency in high-level decisions across days.
Figure 5: Execution strategy similarity over time in the easy environment. Execution-level behaviors exhibit substantially higher temporal variability than macro strategies.
...and 2 more figures

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Abstract

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Authors

Abstract

Table of Contents

Figures (7)