Table of Contents
Fetching ...

The Laplacian Keyboard: Beyond the Linear Span

Siddarth Chandrasekar, Marlos C. Machado

TL;DR

The paper introduces the Laplacian Keyboard (LK), a hierarchical RL framework that leverages graph Laplacian eigenvectors as a task-agnostic, low-frequency basis for reward approximation and behavior generation. By pre-training a Laplacian encoder and a Universal Successor Feature Approximator on reward-free data, LK constructs a continuous library of options; a meta-policy then stitches these options to solve downstream tasks, achieving zero-shot optimality for rewards in the basis span and improved sample efficiency beyond it. Theoretical bounds relate reward smoothness and basis dimensionality to value-function approximation error, and empirical results on DeepMind Control tasks show LK matches strong zero-shot baselines and surpasses flat RL in sample efficiency, while approaching privileged baselines like OKB without handcrafted features. Overall, LK provides a scalable behavioral foundation that integrates representation- and behavior-based approaches, with potential for online extensions and more flexible termination strategies in large, complex environments.

Abstract

Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), these eigenvectors provide a natural basis for approximating reward functions; however, their use is typically limited to their linear span, which restricts expressivity in complex environments. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond the linear span. LK constructs a task-agnostic library of options from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these options dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK surpasses zero-shot solutions while achieving improved sample efficiency compared to standard RL methods.

The Laplacian Keyboard: Beyond the Linear Span

TL;DR

The paper introduces the Laplacian Keyboard (LK), a hierarchical RL framework that leverages graph Laplacian eigenvectors as a task-agnostic, low-frequency basis for reward approximation and behavior generation. By pre-training a Laplacian encoder and a Universal Successor Feature Approximator on reward-free data, LK constructs a continuous library of options; a meta-policy then stitches these options to solve downstream tasks, achieving zero-shot optimality for rewards in the basis span and improved sample efficiency beyond it. Theoretical bounds relate reward smoothness and basis dimensionality to value-function approximation error, and empirical results on DeepMind Control tasks show LK matches strong zero-shot baselines and surpasses flat RL in sample efficiency, while approaching privileged baselines like OKB without handcrafted features. Overall, LK provides a scalable behavioral foundation that integrates representation- and behavior-based approaches, with potential for online extensions and more flexible termination strategies in large, complex environments.

Abstract

Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), these eigenvectors provide a natural basis for approximating reward functions; however, their use is typically limited to their linear span, which restricts expressivity in complex environments. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond the linear span. LK constructs a task-agnostic library of options from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these options dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK surpasses zero-shot solutions while achieving improved sample efficiency compared to standard RL methods.
Paper Structure (49 sections, 4 theorems, 30 equations, 17 figures, 7 tables)

This paper contains 49 sections, 4 theorems, 30 equations, 17 figures, 7 tables.

Key Result

Theorem 3.1

Let $r: \mathscr{S} \to \mathbb{R}$ be a reward function with bounded variationA formal definition of bounded variation is provided in Eqn. eqn:bounded_variation in Appendix sec:app_background. Intuitively, it requires the reward function to vary smoothly over the state space. and $\mathbf{r}_k$ its where $\lambda_k$ is the $k$-th eigenvalue.

Figures (17)

  • Figure 1: Overview of the Laplacian Keyboard. (Left) A reward-free dataset is used to learn a representation basis and a USFA, inducing a behavior basis, which are directions in a latent space. For illustration, in this figure, we depict those as behaviours following straight-line trajectories. (Right) In a downstream navigation task, an agent (the boat) executes these options to interact with the environment. A zero-shot policy picks and executes one single option, producing a potentially suboptimal trajectory (red), whereas the LK trains a meta-policy that sequentially selects and switches between options, composing a piecewise trajectory (blue) that better matches the task.
  • Figure 2: We illustrate the Laplacian basis in the Four-Room environment, a toy domain introduced by sutton1999between. The hotter (redder) the state's color, the larger the corresponding entry; color scales are normalized independently for each subplot. (a) Selected eigenvectors of the graph Laplacian induced by a uniform random policy, with values in parentheses indicating the eigenvector graph norm (Eqn. \ref{['eqn:grap-norm']}). (b) A sample reward function and its reconstruction using the first $k$ eigenvectors. Values in parentheses denote the mean squared error between the original and reconstructed reward functions. Increasing the basis size improves reconstruction accuracy.
  • Figure 3: The Laplacian Keyboard Framework. During pre-training, the agent first learns graph Laplacian eigenvectors through the Laplacian Encoder , then uses these as state representations to learn a continuous library of options via a USFA . In the downstream phase, a meta-policy learns to stitch these base policies, enabling rapid and sample-efficient adaptation to new tasks. Numbers indicate the learning sequence of each module.
  • Figure 4: The LK stitches behaviors from the learned basis to approximate the optimal policy when the reward function lies outside the span of the eigenvectors. We use the same reward function $r$ from Figure \ref{['fig:eigvecs']}. The first panel contains the optimal policy of the original reward function to reach the goal state. The second panel displays the optimal policy of the reconstructed reward function with $k=6$ eigenvectors (${r}_6$). This policy fails to reach the goal state. The last three panels show how LK stitches three individual options, each terminated after a fixed horizon $t_{\text{term}} = 6$. Together, these options form a near-optimal policy that closely replicates the true optimal behavior through a hierarchical control loop.
  • Figure 5: Instances from the Laplacian behavior basis. The left panel shows the behavior corresponding to $\mathbf{w}_{-\mathbf{e}{_1}}=[-1, 0, 0, \dots]$, while the right panel shows the behavior corresponding to $\mathbf{w}_{+\mathbf{e}{_1}}=[+1, 0, 0, \dots]$.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Theorem 3.1: Value Approximation Error Bound
  • Lemma 2.1: Parseval's Theorem
  • Definition 2.2: Graph Total Variation
  • Lemma 2.3: Reconstruction Bound zhu2012approximating
  • Theorem 2.4: Value Approximation Error Bound (restated)
  • proof