SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan S. de Lara; Florian Shkurti

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan S. de Lara, Florian Shkurti

TL;DR

Significant evidence is provided consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses.

Abstract

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

TL;DR

Abstract

Paper Structure (41 sections, 32 equations, 16 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 32 equations, 16 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Problem statement
Experimental setup
Benchmarks
Offline-to-online Setup
Linear connectivity of offline and online maxima
Score Matched Actor-Critic (SMAC): regularizing Q-values with dataset scores
Estimating the dataset's score
Regularizing the $Q$-function with score matching
Using Muon as an optimizer
Experimental Results
Related works
Limitations & future work
Conclusion
...and 26 more sections

Figures (16)

Figure 1: Past offline RL methods converge to maxima separated from online optima by low-reward valleys. Top: reward landscapes on the Kitchen task for CalQL (left) and SMAC (right). Blue and checkered flags being the real locations of the pre-trained and fine-tuned checkpoints on the landscape respectively. The paths and red/yellow flag are illustrative annotations showing the hypothesized trajectory during transfer. Paths demonstrate the existence of a low reward valley between pre-trained and fine-tuned checkpoints when using CalQL. Our method SMAC has no such valleys and is on a unified hill with the fine-tuning checkpoint. Bottom: SMAC vs. CalQL performance in the Kitchen task. See Section \ref{['section:why_problem']} for analysis.
Figure 2: Increasing dataset size and coverage does not bridge offline-to-online gap. We generate rollouts in two environments with a policy that has a 0.7 success rate and plot the offline-to-online performance as we increase the dataset size. We observe that even when the dataset is so large that it is sufficient for learning optimal policies, the actor-critics found are still quickly unlearned by online fine-tuning.
Figure 3: Reward visualized along a plane in parameter space reveals difference in maxima found by different pre-training and fine-tuning methods on Kitchen task. We see that the SAC maxima are wider and not connected to the pre-trained checkpoint along monotonically improving line across all baselines. Conversely, SAC maxima and SMAC maxima are linearly connected. Subplot titles denote offline algorithm used
Figure 4: Reward valleys when linearly interpolating between pre-training and fine-tuning checkpoints for all baselines in tasks show linearly disconnected maxima consistent with offline-to-online transfer performance in later plots. We plot the performance along the line between the pre-trained checkpoint and final fine-tuning checkpoint for methods in kitchen-partial (left), door-binary (centre), and hopper-medium-replay (right). $0$ is the pre-trained checkpoint, and $1$ is the SAC fine-tuned checkpoint. Lines show mean over 4 seeds with shading being standard error.
Figure 5: t-SNE projections of training trajectories show linearly disconnected maxima. We take the pre-training checkpoints, SAC fine-tuning checkpoints, and TD3+BC fine-tuning checkpoints and plot their T-SNE projections with lines and arrows signifying the training trajectory/ordering of the checkpoints. We observe that the projected checkpoints (i) travel in straight lines, and (ii) cross a valley of low reward when fine-tuned with SAC but not when fine-tuned with TD3+BC, providing evidence consistent with the reward valley hypothesis.
...and 11 more figures

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

TL;DR

Abstract

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (16)