Table of Contents
Fetching ...

Multiple Mean-Payoff Optimization under Local Stability Constraints

David Klaška, Antonín Kučera, Vojtěch Kůr, Vít Musil, Vojtěch Řehák

TL;DR

The work tackles local stability in mean-payoff optimization by introducing window mean payoffs to bound short-horizon performance while jointly optimizing multiple objectives. It proposes WinMPsynt, a differentiable-programming–based algorithm that uses dynamic programming to compute expectations and gradients with respect to an Eval objective, enabling gradient-driven strategy improvement for finite-memory randomized strategies in Markov decision processes. The authors prove NP-hardness for the problem in general, but demonstrate practical scalability and high-quality strategies on nontrivial instances, thanks to decomposability of Eval and efficient DP computations. Experiments on structured graphs show substantial speedups over naive baselines and reveal the beneficial roles of memory and randomization in achieving near-optimal window payoffs, highlighting the method's procedural impact for practice in dependable autonomous systems.

Abstract

The long-run average payoff per transition (mean payoff) is the main tool for specifying the performance and dependability properties of discrete systems. The problem of constructing a controller (strategy) simultaneously optimizing several mean payoffs has been deeply studied for stochastic and game-theoretic models. One common issue of the constructed controllers is the instability of the mean payoffs, measured by the deviations of the average rewards per transition computed in a finite "window" sliding along a run. Unfortunately, the problem of simultaneously optimizing the mean payoffs under local stability constraints is computationally hard, and the existing works do not provide a practically usable algorithm even for non-stochastic models such as two-player games. In this paper, we design and evaluate the first efficient and scalable solution to this problem applicable to Markov decision processes.

Multiple Mean-Payoff Optimization under Local Stability Constraints

TL;DR

The work tackles local stability in mean-payoff optimization by introducing window mean payoffs to bound short-horizon performance while jointly optimizing multiple objectives. It proposes WinMPsynt, a differentiable-programming–based algorithm that uses dynamic programming to compute expectations and gradients with respect to an Eval objective, enabling gradient-driven strategy improvement for finite-memory randomized strategies in Markov decision processes. The authors prove NP-hardness for the problem in general, but demonstrate practical scalability and high-quality strategies on nontrivial instances, thanks to decomposability of Eval and efficient DP computations. Experiments on structured graphs show substantial speedups over naive baselines and reveal the beneficial roles of memory and randomization in achieving near-optimal window payoffs, highlighting the method's procedural impact for practice in dependable autonomous systems.

Abstract

The long-run average payoff per transition (mean payoff) is the main tool for specifying the performance and dependability properties of discrete systems. The problem of constructing a controller (strategy) simultaneously optimizing several mean payoffs has been deeply studied for stochastic and game-theoretic models. One common issue of the constructed controllers is the instability of the mean payoffs, measured by the deviations of the average rewards per transition computed in a finite "window" sliding along a run. Unfortunately, the problem of simultaneously optimizing the mean payoffs under local stability constraints is computationally hard, and the existing works do not provide a practically usable algorithm even for non-stochastic models such as two-player games. In this paper, we design and evaluate the first efficient and scalable solution to this problem applicable to Markov decision processes.

Paper Structure

This paper contains 31 sections, 3 theorems, 15 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $D = (V,E,p)$ be a graph, $\textit{Pay}_1,\ldots,\textit{Pay}_k$ payoff functions, $d \leq |V|$ a time horizon, and $\textit{Eval}$ an evaluation function. The problem of whether there exists a FR strategy $\sigma$ such that $\textit{wval}^\sigma \leq 0$ is $\mathsf{NP}$-hard. Furthermore, the p

Figures (5)

  • Figure 1: For a memory with $K$ states where $K \in \{1,2\}$, the best deterministic strategy (left) is worse than a suitable randomized strategy (right). The optimal finite memory strategy requires $7$ memory elements.
  • Figure 2: The graph $D_6$ and the payoffs assigned to vertices.
  • Figure 3: More memory states lead to better strategies.
  • Figure 4: Dynamic evaluation procedure outperforms the DFS-based one.
  • Figure 5: The graph $D_6$.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3