Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Zhihao Liu; Xianliang Yang; Zichuan Liu; Yifan Xia; Wei Jiang; Yuanyu Zhang; Lijuan Li; Guoliang Fan; Lei Song; Bian Jiang

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, Bian Jiang

TL;DR

This paper tackles the challenge of exponential action-space growth in multi-agent reinforcement learning by introducing eSpark, an LLM-driven framework that generates exploration functions to prune irrelevant actions. Through zero-shot prompts, an LLM checker, evolutionary search, and iterative reflection, eSpark maintains a narrowed, more informative policy space and guides MARL learning toward better coordination. Empirical results on 15 scenarios across MABIM and SUMO demonstrate consistent improvements over IPPO, including significant scalability gains with hundreds of agents and large SKU counts. The work highlights the practical potential of leveraging large language models for guiding exploration and pruning in complex, cooperative MARL settings, while noting limitations such as homogeneity of agents and reliance on informative feedback for reflection.

Abstract

Multi-agent reinforcement learning (MARL) is employed to develop autonomous agents that can learn to adopt cooperative or competitive strategies within complex environments. However, the linear increase in the number of agents leads to a combinatorial explosion of the action space, which may result in algorithmic instability, difficulty in convergence, or entrapment in local optima. While researchers have designed a variety of effective algorithms to compress the action space, these methods also introduce new challenges, such as the need for manually designed prior knowledge or reliance on the structure of the problem, which diminishes the applicability of these techniques. In this paper, we introduce Evolutionary action SPAce Reduction with Knowledge (eSpark), an exploration function generation framework driven by large language models (LLMs) to boost exploration and prune unnecessary actions in MARL. Using just a basic prompt that outlines the overall task and setting, eSpark is capable of generating exploration functions in a zero-shot manner, identifying and pruning redundant or irrelevant state-action pairs, and then achieving autonomous improvement from policy feedback. In reinforcement learning tasks involving inventory management and traffic light control encompassing a total of 15 scenarios, eSpark consistently outperforms the combined MARL algorithm in all scenarios, achieving an average performance gain of 34.4% and 9.9% in the two types of tasks respectively. Additionally, eSpark has proven to be capable of managing situations with a large number of agents, securing a 29.7% improvement in scalability challenges that featured over 500 agents. The code can be found in https://github.com/LiuZhihao2022/eSpark.git.

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

TL;DR

Abstract

Paper Structure (44 sections, 4 theorems, 27 equations, 18 figures, 17 tables, 1 algorithm)

This paper contains 44 sections, 4 theorems, 27 equations, 18 figures, 17 tables, 1 algorithm.

Introduction
Related works
Preliminaries
Problem formulation and notations
Markov game framework.
Policy with exploration function.
Challenges and motivations
Method
Exploration function generation
Evolutionary search
Reflection and feedback
Experiments
Experiment settings
Experiment results
Performance on MABIM
...and 29 more sections

Key Result

Proposition 1

Figures (18)

Figure 1: eSpark firstly generates $K$ exploration functions via zero-shot creation. Each exploration function is then used to guide an independent policy, and the evolutionary search is performed to find the best-performing policy. Finally, eSpark reflects on the feedback from the best performance policy, refines, and regenerates the exploration functions for the next iteration.
Figure 2: Action selection frequency for IPPO and various pruning methods on the 100 SKUs Lowest scenario. "Actions" represents the replenishment quantity is a multiple of the mean demand within the sliding window. eSpark learns not only to minimize restocking but also to diversify with small purchases below the mean demand, balancing demand fulfillment and overflow prevention.
Figure 3: MABIM inventory model.
Figure 4: The performance comparison between eSpark and IPPO in 100 SKUs scenarios. In capacity-limited scenarios, eSpark strives to meet the demands while minimizing overflow costs, boasting a lower overflow ratio and a higher fulfillment ratio. In the multiple-echelon challenge, eSpark achieves nuanced collaboration across different echelons, ensuring high fulfillment ratios.
Figure 5: System prompt for $\texttt{LLM}_c.$
...and 13 more figures

Theorems & Definitions (6)

Proposition 1
Proposition 2
Proposition 1
proof
Proposition 2
proof

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

TL;DR

Abstract

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (6)