Table of Contents
Fetching ...

Predictable Interval MDPs through Entropy Regularization

Menno van Zutphen, Giannis Delimpaltadakis, Maurice Heemels, Duarte Antunes

TL;DR

The paper addresses robustly minimizing a linear combination of cumulative cost and entropy in interval Markov decision processes (IMDPs) to achieve predictable yet robust control under transition uncertainty. It develops a dynamic-programming-based value-iteration scheme that computes the optimal deterministic policy and the corresponding upper bound on the cost-entropy objective, by solving convex programs at each time step. The authors prove the existence of deterministic optimal policies and provide a constructive algorithm (mu_algo) with complexity linear in the horizon and policy dimensions. An agricultural-field example demonstrates that entropy regularization can yield highly predictable behavior with only a modest increase in the objective, illustrating practical benefits for real-world autonomous systems.

Abstract

Regularization of control policies using entropy can be instrumental in adjusting predictability of real-world systems. Applications benefiting from such approaches range from, e.g., cybersecurity, which aims at maximal unpredictability, to human-robot interaction, where predictable behavior is highly desirable. In this paper, we consider entropy regularization for interval Markov decision processes (IMDPs). IMDPs are uncertain MDPs, where transition probabilities are only known to belong to intervals. Lately, IMDPs have gained significant popularity in the context of abstracting stochastic systems for control design. In this work, we address robust minimization of the linear combination of entropy and a standard cumulative cost in IMDPs, thereby establishing a trade-off between optimality and predictability. We show that optimal deterministic policies exist, and devise a value-iteration algorithm to compute them. The algorithm solves a number of convex programs at each step. Finally, through an illustrative example we show the benefits of penalizing entropy in IMDPs.

Predictable Interval MDPs through Entropy Regularization

TL;DR

The paper addresses robustly minimizing a linear combination of cumulative cost and entropy in interval Markov decision processes (IMDPs) to achieve predictable yet robust control under transition uncertainty. It develops a dynamic-programming-based value-iteration scheme that computes the optimal deterministic policy and the corresponding upper bound on the cost-entropy objective, by solving convex programs at each time step. The authors prove the existence of deterministic optimal policies and provide a constructive algorithm (mu_algo) with complexity linear in the horizon and policy dimensions. An agricultural-field example demonstrates that entropy regularization can yield highly predictable behavior with only a modest increase in the objective, illustrating practical benefits for real-world autonomous systems.

Abstract

Regularization of control policies using entropy can be instrumental in adjusting predictability of real-world systems. Applications benefiting from such approaches range from, e.g., cybersecurity, which aims at maximal unpredictability, to human-robot interaction, where predictable behavior is highly desirable. In this paper, we consider entropy regularization for interval Markov decision processes (IMDPs). IMDPs are uncertain MDPs, where transition probabilities are only known to belong to intervals. Lately, IMDPs have gained significant popularity in the context of abstracting stochastic systems for control design. In this work, we address robust minimization of the linear combination of entropy and a standard cumulative cost in IMDPs, thereby establishing a trade-off between optimality and predictability. We show that optimal deterministic policies exist, and devise a value-iteration algorithm to compute them. The algorithm solves a number of convex programs at each step. Finally, through an illustrative example we show the benefits of penalizing entropy in IMDPs.
Paper Structure (12 sections, 5 theorems, 28 equations, 3 figures, 1 algorithm)

This paper contains 12 sections, 5 theorems, 28 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

(Recursive Expected Cost Computation) The expected cumulative cost (eq:cum_cost) associated with $\mathcal{I}^{\pi,\xi}=(S,\alpha,c,c_h,P^{\pi,\xi},h)$, is given by where $U_0^{\pi,\xi}$ is defined by the recursion with initialization $U_h^{\pi,\xi}(s)=c_h(s)$, for $s\in S$, $k\in\{h-1,h-2,\dots,0\}$.

Figures (3)

  • Figure 1: Left: inspection robot A can progress deterministically in a clockwise fashion while exterminator B is not in its way. Right: when B is present in the quadrant ahead of robot A, A makes a highly unpredictable evasive maneuver.
  • Figure 2: The upper-bound on the linear combination of cumulative cost and entropy under a) the optimal policy and optimal adversary, b) the optimal policy and a random adversary, c) a random policy and random adversary.
  • Figure 3: The locations of robots A and B over time in ten simulated trajectories subject to an optimal policy with no entropy regularization (top figure, $\beta=0$), and ten simulated trajectories subject to an optimal policy with entropy regularization (bottom figure, $\beta=1$). We see that regularization of the policy using entropy has the clear effect of improving the predictability of the system.

Theorems & Definitions (9)

  • Definition 1: IMDP
  • Definition 2: Policy
  • Definition 3: Adversary
  • Lemma 1
  • Lemma 2: Recursive Entropy Computation
  • Theorem 1: Cost-Entropy Trade-Off Minimization
  • Theorem 2: Deterministic Policies Minimize $\overline{J}^{*}(\mathcal{I})$
  • Remark 1
  • Lemma 3: Concavity of $\Phi(p,V)$