Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

Krishna C. Kalagarla; Dhruva Kartik; Dongming Shen; Rahul Jain; Ashutosh Nayyar; Pierluigi Nuzzo

Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

Krishna C. Kalagarla, Dhruva Kartik, Dongming Shen, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

TL;DR

The paper tackles the challenge of designing policies for partially observable systems that must satisfy complex temporal logic constraints while maximizing cumulative reward. It introduces a constrained product POMDP formalism by augmenting the environment with a DFA that encodes the $\textsc{LTL}_f$ specification, and solves the resulting problem via a no-regret primal-dual approach using Exponentiated Gradient, with theoretical bounds on near-optimality and feasibility. The framework is extended to multi-agent settings under information asymmetry by leveraging the common information approach, enabling tractable solutions and preserving performance guarantees. Experiments across single- and multi-agent gridworld scenarios demonstrate effective trade-offs between exploration, reward, and specification satisfaction, validating the method’s practical relevance for safety-critical autonomous systems.

Abstract

Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.

Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

TL;DR

specification, and solves the resulting problem via a no-regret primal-dual approach using Exponentiated Gradient, with theoretical bounds on near-optimality and feasibility. The framework is extended to multi-agent settings under information asymmetry by leveraging the common information approach, enabling tractable solutions and preserving performance guarantees. Experiments across single- and multi-agent gridworld scenarios demonstrate effective trade-offs between exploration, reward, and specification satisfaction, validating the method’s practical relevance for safety-critical autonomous systems.

Abstract

Paper Structure (29 sections, 6 theorems, 51 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 6 theorems, 51 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Labeled POMDPs
Model
Pure and Mixed Policies
Finite Linear Temporal Logic
Deterministic Finite Automaton (DFA)
Problem Formulation and Solution Strategy
Constrained Product POMDP
Constrained POMDP Formulation
A No-regret Learning Approach for Solving the Constrained POMDP
Multi-Agent Systems
Constrained Product Multi-Agent Problem
Global and Local Specifications
...and 14 more sections

Key Result

Theorem 1

For any policy $\mu$, we have Therefore, a policy $\mu^*$ is an optimal solution to Problem probform if and only if it is an optimal solution to Problem prodprobform, and therefore, $\mathcal{R}_{*}^{\mathscr{M}} = \mathcal{R}_{*}^{\mathscr{M}^\times} := \mathcal{R}^*$.

Figures (5)

Figure 1: Model $\mathscr{M}_3$ Reach-Avoid Task.
Figure 2: This plot illustrates how the Lagrange multiplier $\lambda_k$, the reward $\mathcal{R}^\mathscr{M}(\mu_k)$, and the probability of satisfaction ${\mathbb P}_{\varphi}^{\mathscr{M}}(\mu_k)$ evolve with $k$ for the experiment in Section \ref{['exp:3']}.
Figure 3: Model $\mathscr{M}_4$ Random Ordered Task with Goal.
Figure 4: Multi-agent system collision avoidance benchmark with random ordered tasks, one-way lane, and model $\mathscr{M}_5$.
Figure :

Theorems & Definitions (14)

Remark 1
Theorem 1: Equivalence of Problems \ref{['probform']} and \ref{['prodprobform']}
Lemma 1
Remark 2
Theorem 2
Theorem 3
Remark 3
Remark 4
Remark 5
Lemma 2
...and 4 more

Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

TL;DR

Abstract

Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)