Policy Learning for Perturbance-wise Linear Quadratic Control Problem
Haoran Zhang, Wenhao Zhang, Xianping Wu
TL;DR
The paper addresses finite-horizon LQ control with additive noise under a perturbance-wise framework that unifies the classical model, constraint-embedded affine policies, and a Wasserstein DRO approach. It develops an augmented-affine policy representation and derives an exact policy gradient, proving global convergence under constant stepsizes with problem-parameter–based polynomial bounds. The work integrates a ROC-like Riccati recursion for both constrained and distributionally robust variants, and validates the methods through mean-variance portfolio optimization and a real-data dynamic-tracking task, revealing trade-offs across horizon length, trading costs, state penalties, and estimation windows. This yields a practical, theoretically-grounded toolkit for learning robust, constrained LQ controllers with finite data, with potential extensions to richer ambiguity sets and partial-observation settings.
Abstract
We study finite horizon linear quadratic control with additive noise in a perturbancewise framework that unifies the classical model, a constraint embedded affine policy class, and a distributionally robust formulation with a Wasserstein ambiguity set. Based on an augmented affine representation, we model feasibility as an affine perturbation and unknown noise as distributional perturbation from samples, thereby addressing constrained implementation and model uncertainty in a single scheme. First, we construct an implementable policy gradient method that accommodates nonzero noise means estimated from data. Second, we analyze its convergence under constant stepsizes chosen as simple polynomials of problem parameters, ensuring global decrease of the value function. Finally, numerical studies: mean variance portfolio allocation and dynamic benchmark tracking on real data, validating stable convergence and illuminating sensitivity tradeoffs across horizon length, trading cost intensity, state penalty scale, and estimation window.
