Table of Contents
Fetching ...

Combining Reinforcement Learning and Tensor Networks, with an Application to Dynamical Large Deviations

Edward Gillman, Dominic C. Rose, Juan P. Garrahan

TL;DR

The paper addresses computing dynamical large-deviation statistics for trajectory observables in stochastic many-body systems with exponentially large state spaces by marrying reinforcement learning with tensor networks. It introduces ACTeN, which uses a translation-invariant matrix product state (MPS) for the state-value $v_{\psi}(S)$ and a matrix product operator (MPO) for the policy $\pi_{w}(a|S)$ to enable scalable actor-critic learning in 1D systems. Applied to the East model and the ASEP, ACTeN reproduces the scaled cumulant generating function (SCGF) $\theta(\lambda)$ derived by other methods (e.g., DMRG for East; exact diagonalisation for small ASEP) and provides access to optimal dynamics for sampling rare trajectories at system sizes up to $L=50$, surpassing ED in feasibility. The approach demonstrates that tensor-network representations can effectively integrate with RL to tackle both equilibrium and non-equilibrium dynamical LD problems, with broad potential for extension to other multi-agent and physics-informed RL tasks.

Abstract

We present a framework to integrate tensor network (TN) methods with reinforcement learning (RL) for solving dynamical optimisation tasks. We consider the RL actor-critic method, a model-free approach for solving RL problems, and introduce TNs as the approximators for its policy and value functions. Our "actor-critic with tensor networks" (ACTeN) method is especially well suited to problems with large and factorisable state and action spaces. As an illustration of the applicability of ACTeN we solve the exponentially hard task of sampling rare trajectories in two paradigmatic stochastic models, the East model of glasses and the asymmetric simple exclusion process (ASEP), the latter being particularly challenging to other methods due to the absence of detailed balance. With substantial potential for further integration with the vast array of existing RL methods, the approach introduced here is promising both for applications in physics and to multi-agent RL problems more generally.

Combining Reinforcement Learning and Tensor Networks, with an Application to Dynamical Large Deviations

TL;DR

The paper addresses computing dynamical large-deviation statistics for trajectory observables in stochastic many-body systems with exponentially large state spaces by marrying reinforcement learning with tensor networks. It introduces ACTeN, which uses a translation-invariant matrix product state (MPS) for the state-value and a matrix product operator (MPO) for the policy to enable scalable actor-critic learning in 1D systems. Applied to the East model and the ASEP, ACTeN reproduces the scaled cumulant generating function (SCGF) derived by other methods (e.g., DMRG for East; exact diagonalisation for small ASEP) and provides access to optimal dynamics for sampling rare trajectories at system sizes up to , surpassing ED in feasibility. The approach demonstrates that tensor-network representations can effectively integrate with RL to tackle both equilibrium and non-equilibrium dynamical LD problems, with broad potential for extension to other multi-agent and physics-informed RL tasks.

Abstract

We present a framework to integrate tensor network (TN) methods with reinforcement learning (RL) for solving dynamical optimisation tasks. We consider the RL actor-critic method, a model-free approach for solving RL problems, and introduce TNs as the approximators for its policy and value functions. Our "actor-critic with tensor networks" (ACTeN) method is especially well suited to problems with large and factorisable state and action spaces. As an illustration of the applicability of ACTeN we solve the exponentially hard task of sampling rare trajectories in two paradigmatic stochastic models, the East model of glasses and the asymmetric simple exclusion process (ASEP), the latter being particularly challenging to other methods due to the absence of detailed balance. With substantial potential for further integration with the vast array of existing RL methods, the approach introduced here is promising both for applications in physics and to multi-agent RL problems more generally.
Paper Structure (5 sections, 32 equations, 4 figures)

This paper contains 5 sections, 32 equations, 4 figures.

Figures (4)

  • Figure 1: Actor-Critic with tensor networks (ACTeN) (a) Sketch of a Markov decision process. (b) In actor-critic RL, the state is passed to an "actor", which chooses the action, and to a "critic", which values the state given the reward. This value is used to improve the actor's policy. In ACTeN, the function approximators for actor and critic are tensor networks. (c) Top: typical trajectory of the ASEP at half-filling and $L=50$ sites with one particle highlighted (blue), shown for $3000$ steps. Bottom: trajectory with a current large deviation, sampled from the ACTeN solution for biasing (counting) field $\lambda=-3$. See the text for details.
  • Figure 2: Dynamical large deviations in the East model using ACTeN. Scaled-cumulant generating function for the dynamical activity of the East model as a function of biasing field $\lambda$ from ACTeN (symbols), for $L=50$ and PBC. Our RL results coincide with those obtained from the current state-of-the-art method using DMRG, cf. Ref. Banuls2019 (which is possible since the East model obeys detailed balance). Inset: Kinetic constraint of the East model; a spin, $s_{i}$, can flip, $s_{i} \to 1 - s_{i}$, only if the spin to the left is up, $s_{i-1} = 1$.
  • Figure 3: Dynamical large deviations in the ASEP using ACTeN.(a) In the ASEP particles can only move to an unoccupied neighbouring site, with probability $p$ to the left and $q=1-p$ to the right. (b) SCGF for the time-integrated particle current as a function of biasing field. We show results from ACTeN for $p=0.1$ (squares) and $p=1/2$ (diamonds). The lack of detailed balance for PBC and $p \neq 1/2$ prevents straightforward application of DRMG, but for small sizes (here $L=14$) we can compare to exact diagonalisation (blue curve for $p=0.1$, green for $p=1/2$). (c) SCGF for $p=0.1$ from ACTeN for size $L=50$ which is beyond the scope of ED. Compared to $L=14$ (blue curve from ED), we see that ACTeN captures the flattening of the SCGF for larger sizes indicative of a LD phase transition, cf. Ref. Jack2015a. The inset shows the smooth convergence of our ACTeN numerics with $L$ for two values of $\lambda$. (d) Since ACTeN provides direct access to the optimal dynamics, observables such as the time-integrated current can be evaluated directly (black squares for $L=50$). We show for comparison the numerical differentiation of the ACTeN SCGF (red circles) and of the ED SCGF at $L=14$ (blue line).
  • Figure 4: Training Procedure and Learning Curves (ASEP)(a) For each bias [we show $\lambda = -1$ (top row), $\lambda =1$ (middle row), $\lambda =2$ (bottom row)] TN-based policies and value-functions are produced via actor-critic optimization. These are initiated at random for $L=4$ with $\chi=16$ and trained for $10^{6}$ steps. Every $5000$ training steps the average reward of the policy is evaluated over $10^{4}$ steps (black squares) and the weights of the policy (which we call a "snapshot" for that time) are stored. The evaluated values can be compared to the training estimate of $r(\pi)$ (red circles), which tends to overestimate $r(\pi)$ initially. The policy snapshot with the highest evaluated $r$ (blue dashed line) is used to initiate the policy for higher values of $L$. This is repeated every $\Delta L = 2$ up to $L=50$, with $L=14$ shown here. (b) For each bias, several policies (here six) are independently trained via the same procedure from different random initial conditions. This produces a distribution of evaluated average rewards, here represented by the median (black squares) and inter-quartile range (red-shaded region). The policy with the maximum average reward at each $L$ is selected as the optimal dynamics (blue triangles). (c) Same as (a) for $L=50$. The learning curves appear nosier than in (a) but note that the vertical scale is much smaller. The learning rate is kept fixed throughout. (d) The distribution of $r$ across parallel agents for $L=50$ is again much tighter than for $L=14$.