Table of Contents
Fetching ...

Policy Gradient for LQR with Domain Randomization

Tesshu Fujinami, Bruce D. Lee, Nikolai Matni, George J. Pappas

TL;DR

This work analyzes policy gradient learning for domain-randomized LQR (DR-LQR), showing that PG globally converges to the minimizer of a finite-sample DR objective $J_{SA}$ under a small heterogeneity bound on sampled systems. It introduces discount-factor annealing to achieve joint stabilization across multiple DR systems, with a finite-time guarantee tying progress to the smoothness of the sample-average cost. The paper quantifies how the SA solution approximates the true DR optimum and characterizes the sample complexity needed to achieve a desired performance gap, supported by numerical and hardware experiments. Overall, the results provide a principled foundation for gradient-based DR methods in continuous control and suggest robust, practical strategies for sim-to-real transfer with LQR-like controllers.

Abstract

Domain randomization (DR) enables sim-to-real transfer by training controllers on a distribution of simulated environments, with the goal of achieving robust performance in the real world. Although DR is widely used in practice and is often solved using simple policy gradient (PG) methods, understanding of its theoretical guarantees remains limited. Toward addressing this gap, we provide the first convergence analysis of PG methods for domain-randomized linear quadratic regulation (LQR). We show that PG converges globally to the minimizer of a finite-sample approximation of the DR objective under suitable bounds on the heterogeneity of the sampled systems. We also quantify the sample-complexity associated with achieving a small performance gap between the sample-average and population-level objectives. Additionally, we propose and analyze a discount-factor annealing algorithm that obviates the need for an initial jointly stabilizing controller, which may be challenging to find. Empirical results support our theoretical findings and highlight promising directions for future work, including risk-sensitive DR formulations and stochastic PG algorithms.

Policy Gradient for LQR with Domain Randomization

TL;DR

This work analyzes policy gradient learning for domain-randomized LQR (DR-LQR), showing that PG globally converges to the minimizer of a finite-sample DR objective under a small heterogeneity bound on sampled systems. It introduces discount-factor annealing to achieve joint stabilization across multiple DR systems, with a finite-time guarantee tying progress to the smoothness of the sample-average cost. The paper quantifies how the SA solution approximates the true DR optimum and characterizes the sample complexity needed to achieve a desired performance gap, supported by numerical and hardware experiments. Overall, the results provide a principled foundation for gradient-based DR methods in continuous control and suggest robust, practical strategies for sim-to-real transfer with LQR-like controllers.

Abstract

Domain randomization (DR) enables sim-to-real transfer by training controllers on a distribution of simulated environments, with the goal of achieving robust performance in the real world. Although DR is widely used in practice and is often solved using simple policy gradient (PG) methods, understanding of its theoretical guarantees remains limited. Toward addressing this gap, we provide the first convergence analysis of PG methods for domain-randomized linear quadratic regulation (LQR). We show that PG converges globally to the minimizer of a finite-sample approximation of the DR objective under suitable bounds on the heterogeneity of the sampled systems. We also quantify the sample-complexity associated with achieving a small performance gap between the sample-average and population-level objectives. Additionally, we propose and analyze a discount-factor annealing algorithm that obviates the need for an initial jointly stabilizing controller, which may be challenging to find. Empirical results support our theoretical findings and highlight promising directions for future work, including risk-sensitive DR formulations and stochastic PG algorithms.

Paper Structure

This paper contains 23 sections, 19 theorems, 60 equations, 6 figures, 2 algorithms.

Key Result

Lemma III.1

Suppose that the collection $(\theta_1, \dots, \theta_M)$ is $(B,\mathbf{s})$-bounded. Let $\mathcal{K}_\zeta \triangleq \mathopen{}\left\{K\in\mathcal{K} \colon J_{SA}(K)\leq\zeta\right\}\mathclose{}$. It holds that $\mathcal{K}_\zeta \subseteq \cap_{i=1}^M\mathcal{K}_{\mathbf{s}\zeta}^i$ for any $

Figures (6)

  • Figure 1: Convergence of policy gradient with domain randomization. A controller initialized at $K=0$ converges to the optimal controller via progressive discounting and gradient descent (left), resulting in the convergence of the excess sample average cost (center). With more samples, the minimizer for the sample average cost and the domain randomization objective start to match (right).
  • Figure 2: Convergence of policy gradient with the entropic risk measure using $t=1.0$
  • Figure 3: Stochastic Gradient Descent (\ref{['alg: sgd']}) applied to the linearized inverted pendulum of \ref{['eq: linearized pendulum']}.
  • Figure 4: Gradient Descent (\ref{['alg: progressive discounting']}) applied to the linearized rotational inverted pendulum.
  • Figure 5: 10 trials of CE stabilization. The pendulum almost achieves the upright posture (left) while fails to remain within $\pm90$ degrees (right).
  • ...and 1 more figures

Theorems & Definitions (34)

  • Definition III.1
  • Lemma III.1: Sum of LQR Costs is Coercive
  • proof
  • Lemma III.2: Convergence to a Fixed Point
  • Lemma III.3: Approximate Gradient Domination
  • proof
  • Lemma III.4
  • proof
  • Lemma III.5
  • Lemma III.6
  • ...and 24 more