Policy Gradient Methods for Designing Dynamic Output Feedback Controllers

Tomonori Sadamoto; Takumi Hirai

Policy Gradient Methods for Designing Dynamic Output Feedback Controllers

Tomonori Sadamoto, Takumi Hirai

TL;DR

The paper tackles the challenge of designing dynamic output feedback controllers for discrete-time partially observable systems using policy-gradient methods. It introduces an $L$-length input-output history (IOH) framework that recasts dynamic output feedback as a state-feedback problem on an IOH-embedded system, enabling a model-based PGM with global linear convergence via the Polyak–Łojasiewicz inequality applied to a lossless projection of the IOH dynamics. It also develops model-free, zeroth-order PGM variants with Monte Carlo gradient estimates and provides a rigorous sample-complexity analysis, supported by numerical simulations that show robustness to noise and scalability to larger networks. Collectively, this work advances data-driven control by delivering provable convergence guarantees and practical learning algorithms for dynamic output feedback in partially observed settings.

Abstract

This paper proposes model-based and model-free policy gradient methods (PGMs) for designing dynamic output feedback controllers for discrete-time partially observable systems. To fulfill this objective, we first show that any dynamic output feedback controller design is equivalent to a state-feedback controller design for a newly introduced system whose internal state is a finite-length input-output history (IOH). Next, based on this equivalency, we propose a model-based PGM and show its global linear convergence by proving that the Polyak-Lojasiewicz inequality holds for a reachability-based lossless projection of the IOH dynamics. Moreover, we propose two model-free implementations of the PGM: the multi- and single-episodic PGM. The former is a Monte Carlo approximation of the model-based PGM, whereas the latter is a simplified version of the former for ease of use in real systems. A sample complexity analysis of both methods is also presented. Finally, the effectiveness of the model-based/model-free PGMs is investigated through a numerical simulation.

Policy Gradient Methods for Designing Dynamic Output Feedback Controllers

TL;DR

The paper tackles the challenge of designing dynamic output feedback controllers for discrete-time partially observable systems using policy-gradient methods. It introduces an

-length input-output history (IOH) framework that recasts dynamic output feedback as a state-feedback problem on an IOH-embedded system, enabling a model-based PGM with global linear convergence via the Polyak–Łojasiewicz inequality applied to a lossless projection of the IOH dynamics. It also develops model-free, zeroth-order PGM variants with Monte Carlo gradient estimates and provides a rigorous sample-complexity analysis, supported by numerical simulations that show robustness to noise and scalability to larger networks. Collectively, this work advances data-driven control by delivering provable convergence guarantees and practical learning algorithms for dynamic output feedback in partially observed settings.

Abstract

Paper Structure (23 sections, 14 theorems, 121 equations, 10 figures, 1 algorithm)

This paper contains 23 sections, 14 theorems, 121 equations, 10 figures, 1 algorithm.

Introduction
Problem Setup
Preliminary for Formulation
Formulation and Overview of Approach
Model-Based Policy Gradient Method and its Convergence Analysis
Model-Based Policy Gradient Method
Convergence Analysis
Model-Free Policy Gradient Method and Sample Complexity Analysis
Model-Free PGM
Sample Complexity Analysis
Numerical Simulation
Conclusion
Proof of Lemma \ref{['lem_VARX']}
Proof of Lemma \ref{['lem2']}
Proof of Proposition \ref{['prop0_1']}
...and 8 more sections

Key Result

Lemma 1

Consider ${\bm \Sigma}_{\rm s}$ in 1 and $v$ in def_IOH. If holds, then for any $u$ and $x(0)$, the IOH $v$ and output $y$ obey where $\Gamma \space\coloneqq\space \left[{\mathcal{R}}_L({\bm \Sigma}_{\rm s}) - A^L{\mathcal{O}}_L^{\dagger}({\bm \Sigma}_{\rm s}){\mathcal{H}}_L({\bm \Sigma}_{\rm s}), ~A^L{\mathcal{O}}_L^{\dagger}({\bm \Sigma}_{\rm s})\right]$, and $\Pi \coloneqq [0_{m \times (L-1)

Figures (10)

Figure 1: (Blue solid line) Variation of $J(K_i)$ in \ref{['defJ']} for the iteration $i$ of \ref{['gd']} when $L=2$. (Black dotted line) $J(K^{\star})$, where $K^{\star}$ is given by \ref{['optK']}.
Figure 2: Bode diagrams of ${\bm K}^{\star}_{\rm s}$ and ${\bm K}_{{\rm s},i}$ for $i$ indicated by the circles in Figure \ref{['fig_J_MB']}, where ${\bm K}^{\star}_{\rm s}$ and ${\bm K}_{{\rm s},i}$ are defined as in \ref{['dyn_K']} with \ref{['ABCD_hat']} for $K^{\star}$ and the corresponding $K_i$, respectively.
Figure 3: The blue solid line and red and green dotted lines show the trajectories of $y \in \mathbb R^2$ in \ref{['1']} when $u = K_{\rm SF}^{\star}x$, ${\bm K}_{{\rm s},50\times 10^5}^{(2)}$ and ${\bm K}_{{\rm s},50\times 10^5}^{(4)}$ are actuated at $t=0$, $t=2$, and $t=4$, respectively.
Figure 4: Variation of $J(K_i)$ in \ref{['defJ']} when $L=4$.
Figure 5: (Colored area) 50 variations of $J(\tilde{K}_i)$ in \ref{['defJ']} for $\{s, N\} = \{1,50\}, \{1,500\}, \{10,500\}$ when $L=2$, where $\tilde{K}_i$ is generated by Algorithm 1. (Colored broken lines) The average of the corresponding 50 variations.
...and 5 more figures

Theorems & Definitions (34)

Definition 1
Lemma 1
proof
Lemma 2
proof
Remark 1
Proposition 1
proof
Remark 2
Remark 3
...and 24 more

Policy Gradient Methods for Designing Dynamic Output Feedback Controllers

TL;DR

Abstract

Policy Gradient Methods for Designing Dynamic Output Feedback Controllers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (34)