Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Noam Razin; Yotam Alexander; Edo Cohen-Karlik; Raja Giryes; Amir Globerson; Nadav Cohen

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Noam Razin, Yotam Alexander, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Nadav Cohen

TL;DR

The paper addresses how policy gradient generalizes to unseen initial states in the underdetermined finite-horizon LQR by introducing two extrapolation metrics ${\mathcal E}_{\mathrm{opt}}$ and ${\mathcal E}_{\mathrm{cost}}$. It develops a theoretical framework linking extrapolation to system-induced exploration, proves constructive results in an exploration-inducing setting (including a shift-A system) where extrapolation can be perfect as the horizon grows, and analyzes a typical random-system setting with nontrivial extrapolation in expectation and with high probability for large state dimensions. The study also shows that the implicit bias in optimal control does not simply minimize the Euclidean norm and extends insights to nonlinear dynamics and neural controllers, corroborating theory with experiments on LQR and nonlinear systems. Practically, it suggests that carefully choosing training initial states to promote exploration can substantially improve extrapolation in real-world control tasks, with implications for safety and robustness in robotics and autonomous systems.

Abstract

In modern machine learning, models can often fit training data in numerous ways, some of which perform well on unseen (test) data, while others do not. Remarkably, in such cases gradient descent frequently exhibits an implicit bias that leads to excellent performance on unseen data. This implicit bias was extensively studied in supervised learning, but is far less understood in optimal control (reinforcement learning). There, learning a controller applied to a system via gradient descent is known as policy gradient, and a question of prime importance is the extent to which a learned controller extrapolates to unseen initial states. This paper theoretically studies the implicit bias of policy gradient in terms of extrapolation to unseen initial states. Focusing on the fundamental Linear Quadratic Regulator (LQR) problem, we establish that the extent of extrapolation depends on the degree of exploration induced by the system when commencing from initial states included in training. Experiments corroborate our theory, and demonstrate its conclusions on problems beyond LQR, where systems are non-linear and controllers are neural networks. We hypothesize that real-world optimal control may be greatly improved by developing methods for informed selection of initial states to train on.

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

TL;DR

The paper addresses how policy gradient generalizes to unseen initial states in the underdetermined finite-horizon LQR by introducing two extrapolation metrics

and

. It develops a theoretical framework linking extrapolation to system-induced exploration, proves constructive results in an exploration-inducing setting (including a shift-A system) where extrapolation can be perfect as the horizon grows, and analyzes a typical random-system setting with nontrivial extrapolation in expectation and with high probability for large state dimensions. The study also shows that the implicit bias in optimal control does not simply minimize the Euclidean norm and extends insights to nonlinear dynamics and neural controllers, corroborating theory with experiments on LQR and nonlinear systems. Practically, it suggests that carefully choosing training initial states to promote exploration can substantially improve extrapolation in real-world control tasks, with implications for safety and robustness in robotics and autonomous systems.

Abstract

Paper Structure (47 sections, 25 theorems, 266 equations, 12 figures, 4 tables)

This paper contains 47 sections, 25 theorems, 266 equations, 12 figures, 4 tables.

Introduction
Related Work
Preliminaries
Policy Gradient in Linear Quadratic Control
Underdetermined Linear Quadratic Control
Quantifying Extrapolation
Analysis of Implicit Bias
Intuition: Extrapolation Depends on Exploration
Extrapolation Requires Exploration
Extrapolation in Exploration-Inducing Setting
Implicit Bias in Optimal Control $\neq$ Euclidean Norm Minimization
Extrapolation in Typical Setting
Experiments
Linear Quadratic Control
Non-Linear Systems and Neural Network Controllers
...and 32 more sections

Key Result

Proposition 1

For any iteration $t \in {\mathbb N}$ of policy gradient, the following hold.

Figures (12)

Figure 1: Intuition behind our theoretical analysis: in underdetermined LQR problems (\ref{['sec:prelim:underdetermined']}), the extent to which a controller learned via policy gradient extrapolates to initial states unseen in training, depends on the degree of exploration induced by the system when commencing from initial states that were seen in training. Illustrated are the state dynamics induced by the $t$'th iterate of policy gradient ${\mathbf K}^{(t)}$ (left), by the final policy gradient controller ${\mathbf K}_{\mathrm{pg}}$ (middle), and by the non-extrapolating controller ${\mathbf K}_{\mathrm{no\text{-}ext}}$ defined in \ref{['sec:prelim:extrapolation']} (right). An arbitrary controller ${\mathbf K}$ extrapolates to initial states unseen in training if $\norm{ ({\mathbf A} + {\mathbf B} {\mathbf K}) {\mathbf x}}^2$ is small for ${\mathbf x} \in {\mathcal{S}}^\perp$, i.e. if the dynamics induced by ${\mathbf K}$ send towards zero states that are orthogonal to the set ${\mathcal{S}}$ of initial states seen in training (see \ref{['sec:prelim:extrapolation']}). Due to the structure of training cost gradients, the dynamics induced by the final policy gradient controller ${\mathbf K}_{\mathrm{pg}}$ send towards zero every state encountered in training. Accordingly, the extent to which ${\mathbf K}_{\mathrm{pg}}$ extrapolates depends on the degree of exploration --- the overlap of states encountered in training with directions orthogonal to ${\mathcal{S}}$. On the other hand, the controller ${\mathbf K}_{\mathrm{no\text{-}ext}}$ ensures that states in ${\mathcal{S}}$ are sent to zero (thereby minimizing the training cost), but does not handle states in ${\mathcal{S}}^\perp$. It thus does not extrapolate.
Figure 2: In underdetermined LQR problems (\ref{['sec:prelim:underdetermined']}), the extent to which linear controllers learned via policy gradient extrapolate to initial states unseen in training, depends on the degree of exploration that the system induces from initial states that were seen in training. We evaluated LQR problems with state space dimension $D = 5$, horizon $H = 5$ (further experiments with larger $D$ and $H$ are reported in \ref{['app:experiments:lqr']}), and three different linear systems: (i) an “identity" system with ${\mathbf A} = {\mathbf I} \in {\mathbb R}^{D \times D}$ (analyzed in \ref{['sec:analysis:no_gen']}); (ii) a “shift" system with ${\mathbf A} = \sum\nolimits_{d = 1}^D {\mathbf e}_{d \% D + 1} {\mathbf e}_d^\top$ (analyzed in \ref{['sec:analysis:shift']}); and (iii) a random system, where the entries of ${\mathbf A}$ are sampled independently from a zero-mean Gaussian with standard deviation $1 / \sqrt{D}$ (analyzed in \ref{['sec:analysis:general']}). Reported are the optimality (\ref{['def:opt_measure']}) and cost (\ref{['def:cost_measure']}) measures of extrapolation, normalized by the respective quantities attained by the non-extrapolating controller ${\mathbf K}_{\mathrm{no\text{-}ext}}$ (see \ref{['sec:prelim:extrapolation']}). A value of one corresponds to trivial (no) extrapolation and a value of zero corresponds to perfect extrapolation. Bar heights stand for median values over $20$ runs differing in random seed, and error bars span the interquartile range ($25$'th to $75$'th percentiles). Results: In agreement with our theory: (i) no extrapolation takes place under the “identity" system, which does not induce exploration from initial states seen in training; while (ii) substantial extrapolation is achieved under the “shift" and random systems, which induce exploration. The extrapolation under “shift" and random systems is not perfect, and this is also in agreement with our theory. Note that our theory does not explain why random systems often (but not always) lead to less extrapolation than the “shift" system. Refining our analysis to explain this intricacy is an interesting direction for future work.
Figure 3: In the pendulum and quadcopter control problems (see \ref{['sec:experiments:nonlinear']}), training a (non-linear) neural network controller via policy gradient often leads to a solution that extrapolates to initial states unseen in training, despite the existence of non-extrapolating solutions. Left: Initial states seen in training (blue) and initial states unseen in training that are used for evaluating extrapolation (red). Middle: Final states of trajectories emanating from initial states on the left, where the trajectories are steered by a (state-feedback) controller learned via policy gradient. The controller is parameterized as a fully-connected neural network with ReLU activation. Right: Final states of trajectories emanating from initial states on the left, where the trajectories are steered by a non-extrapolating controller, i.e. a controller that minimizes the cost for initial states seen in training while performing poorly on initial states unseen in training. We obtained such a controller by modifying the training objective to encourage steering unseen initial states to a state different than the target state. Results: Since an uncontrolled pendulum or quadcopter falls downwards from a given initial state, the systems qualitatively induce exploration of states with lower height. Complying with our theory for LQR problems (\ref{['sec:analysis']}), policy gradient yields near-perfect extrapolation to unseen initial states lower that those used for training. In particular, the cost measure of extrapolation, normalized by that attained by the non-extrapolating controller, is near the minimal value of zero (a value of one stands for no extrapolation). Further details in \ref{['app:experiments']}:\ref{['table:pend_experiments_states', 'table:quad_below_experiments_states']} fully specify the initial and final states depicted above, and \ref{['fig:pend_experiments_states_through_time', 'fig:quad_below_experiments_states_through_time']} present the evolution of states through time under the policy gradient and non-extrapolating controllers.
Figure 4: In underdetermined LQR problems (\ref{['sec:prelim:underdetermined']}), the extent to which linear controllers learned via policy gradient extrapolate to initial states unseen in training, depends on the degree of exploration that the system induces from initial states that were seen in training. This figure supplements \ref{['fig:lqr_experiments_main']} by including results for analogous experiments over systems with a longer time horizon $H = 8$ (instead of $H = 5$). Results: The increase in time horizon improved extrapolation to unseen initial states, in accordance with the analysis of \ref{['sec:analysis:shift']}. A drawback of increasing the time horizon, however, is that it can lead to instabilities during training ( cf.metz2021gradients). Indeed, for state space dimension $D = 5$, we were unable to consistently train controllers when the time horizon was substantially longer than $H = 8$. Thus, techniques enabling stable training with long time horizons may be a promising tool for improving extrapolation.
Figure 5: In underdetermined LQR problems (\ref{['sec:prelim:underdetermined']}), the extent to which linear controllers learned via policy gradient extrapolate to initial states unseen in training, depends on the degree of exploration that the system induces from initial states that were seen in training. This figure supplements \ref{['fig:lqr_experiments_main']} by including results for analogous experiments over systems with a larger state space dimension $D = 40$ and horizon $H = 40$ (instead of $D = H = 5$). To reduce the cost of experiments with a larger state space dimension and longer horizon, we carried out $10$ (instead of $20$) runs per system type and number of initial states seen in training.
...and 7 more figures

Theorems & Definitions (62)

Definition 1
Definition 2
Proposition 1
proof : Proof sketch (proof in \ref{['app:proofs:no_exploration_no_extrapolation']})
Remark 1
Proposition 2
proof : Proof sketch (proof in \ref{['app:proofs:shift']})
Remark 2
Lemma 1
proof : Proof sketch (proof in \ref{['app:proofs:min_norm']})
...and 52 more

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

TL;DR

Abstract

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (62)