Table of Contents
Fetching ...

Homomorphic Mappings for Value-Preserving State Aggregation in Markov Decision Processes

Shuo Zhao, Yongqiang Li, Yu Feng, Zhongsheng Hou, Yuanjing Feng

TL;DR

This paper proposes Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation.

Abstract

State aggregation aims to reduce the computational complexity of solving Markov Decision Processes (MDPs) while preserving the performance of the original system. A fundamental challenge lies in optimizing policies within the aggregated, or abstract, space such that the performance remains optimal in the ground MDP-a property referred to as {"}optimal policy equivalence {"}. This paper presents an abstraction framework based on the notion of homomorphism, in which two Markov chains are deemed homomorphic if their value functions exhibit a linear relationship. Within this theoretical framework, we establish a sufficient condition for the equivalence of optimal policy. We further examine scenarios where the sufficient condition is not met and derive an upper bound on the approximation error and a performance lower bound for the objective function under the ground MDP. We propose Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation. In the experiments, we validated the theoretical results and conducted comparative evaluations against seven algorithms.

Homomorphic Mappings for Value-Preserving State Aggregation in Markov Decision Processes

TL;DR

This paper proposes Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation.

Abstract

State aggregation aims to reduce the computational complexity of solving Markov Decision Processes (MDPs) while preserving the performance of the original system. A fundamental challenge lies in optimizing policies within the aggregated, or abstract, space such that the performance remains optimal in the ground MDP-a property referred to as {"}optimal policy equivalence {"}. This paper presents an abstraction framework based on the notion of homomorphism, in which two Markov chains are deemed homomorphic if their value functions exhibit a linear relationship. Within this theoretical framework, we establish a sufficient condition for the equivalence of optimal policy. We further examine scenarios where the sufficient condition is not met and derive an upper bound on the approximation error and a performance lower bound for the objective function under the ground MDP. We propose Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation. In the experiments, we validated the theoretical results and conducted comparative evaluations against seven algorithms.

Paper Structure

This paper contains 15 sections, 8 theorems, 54 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

If matrix $A$ satisfies that $\lim_{t \rightarrow \infty }A^{t}= {0}$, then $(I - A)^{-1} = \sum_{t = 0}^{\infty} A^{t}$.

Figures (3)

  • Figure 1: This figure illustrates the relationship between the ground Markov chain, encoding Markov chains, and the homomorphic Markov chain. In general, encoding Markov chains may exhibit discrepancies relative to the ground Markov chain. However, there always exists a homomorphic Markov chain corresponding to any encoding Markov chain. Therefore, encoding Markov chains can serve as a critical bridge connecting the ground MDP and homomorphic mappings.
  • Figure 2: In the experimental results, the $x$-axis represents the number of iterations, while the $y$-axis indicates policy performance. At the top of each task subplot are the corresponding task names, with Task "Random Model" comprising three scenarios of different density levels (10%, 50%, and 100%). The curves labeled "100%", "80%", "50%", and "20%" in the figure correspond to different settings of the abstract state space size, where $|U| = int(0.2 * r)$, $|U| = int(0.5 * r)$, $|U| = int(0.8 * r)$, and $|U| = int(r)$, respectively. Figures (a)-(f) show the results of Algorithm \ref{['alg_HM']} under different values of $|U|$, while Figures (g)-(l) present the results of Algorithm \ref{['alg_HM2']} under the same settings. In all figures, the purple dashed line represents the policy performance after $40,000$ iterations of the policy iteration algorithm, which serves as an approximation of the optimal policy performance. In Figures (g)-(l), solid lines indicate actual policy performance (correspond to the left $y$-axis), while dashed lines represent the performance lower bound (In subfigures (k) and (l), the dashed lines correspond to the right $y$-axis.), corresponding to the $J_{U}(f_{\nu}(\tilde{\pi})) - \frac{\mathcal{k}g(\tilde{\pi}, \nu ) \mathcal{k}}{ 1 - \gamma }$ term in Equation \ref{['polbt_1']}.
  • Figure 3: In the experimental results, the x-axis represents wall-clock time (Execution time on a physical computing system), while the y-axis indicates policy performance. In the experiments, the results corresponding to Algorithm \ref{['alg_HM2']} are labeled as "Our". Accordingly, in the figure, the solid line represents the average over five runs, while the shaded region indicates the range between the maximum and minimum values.

Theorems & Definitions (19)

  • Definition 1: Homomorphic Markov Chain
  • Definition 2: Optimal Policy Equivalence
  • Lemma 1: Matrix Geometric Series bao2008infinite(pp. 328)
  • Theorem 1
  • proof 1
  • Definition 3: Homomorphic Mapping
  • Theorem 2
  • proof 2
  • Definition 4
  • Theorem 3
  • ...and 9 more