Table of Contents
Fetching ...

Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance

Wenjun Cao

TL;DR

Reinforcement learning often suffers from severe sample inefficiency, especially in early training. Dynamic Action Interpolation (DAI) introduces a simple, universal action-execution mechanism that linearly blends expert and RL actions with a time-varying weight $\alpha(t)$, requiring only a few lines of code and no changes to loss terms or networks. The authors provide a theoretical framework showing how DAI reshapes state visitation distributions while preserving asymptotic convergence, and validate the approach on four MuJoCo tasks where early learning improves by over $160\%$ on average and final performance improves by over $50\%$ (Humanoid up to $2\times$–$4\times$ gains in early stages). The results suggest that algorithmic simplicity, via execution-level guidance, can outperform more complex prior-knowledge integrations and offer broad applicability across Actor-Critic methods in reinforcement learning.

Abstract

Reinforcement learning (RL) suffers from severe sample inefficiency, especially during early training, requiring extensive environmental interactions to perform competently. Existing methods tend to solve this by incorporating prior knowledge, but introduce significant architectural and implementation complexity. We propose Dynamic Action Interpolation (DAI), a universal yet straightforward framework that interpolates expert and RL actions via a time-varying weight $α(t)$, integrating into any Actor-Critic algorithm with just a few lines of code and without auxiliary networks or additional losses. Our theoretical analysis shows that DAI reshapes state visitation distributions to accelerate value function learning while preserving convergence guarantees. Empirical evaluations across MuJoCo continuous control tasks demonstrate that DAI improves early-stage performance by over 160\% on average and final performance by more than 50\%, with the Humanoid task showing a 4$\times$ improvement early on and a 2$\times$ gain at convergence. These results challenge the assumption that complex architectural modifications are necessary for sample-efficient reinforcement learning.

Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance

TL;DR

Reinforcement learning often suffers from severe sample inefficiency, especially in early training. Dynamic Action Interpolation (DAI) introduces a simple, universal action-execution mechanism that linearly blends expert and RL actions with a time-varying weight , requiring only a few lines of code and no changes to loss terms or networks. The authors provide a theoretical framework showing how DAI reshapes state visitation distributions while preserving asymptotic convergence, and validate the approach on four MuJoCo tasks where early learning improves by over on average and final performance improves by over (Humanoid up to gains in early stages). The results suggest that algorithmic simplicity, via execution-level guidance, can outperform more complex prior-knowledge integrations and offer broad applicability across Actor-Critic methods in reinforcement learning.

Abstract

Reinforcement learning (RL) suffers from severe sample inefficiency, especially during early training, requiring extensive environmental interactions to perform competently. Existing methods tend to solve this by incorporating prior knowledge, but introduce significant architectural and implementation complexity. We propose Dynamic Action Interpolation (DAI), a universal yet straightforward framework that interpolates expert and RL actions via a time-varying weight , integrating into any Actor-Critic algorithm with just a few lines of code and without auxiliary networks or additional losses. Our theoretical analysis shows that DAI reshapes state visitation distributions to accelerate value function learning while preserving convergence guarantees. Empirical evaluations across MuJoCo continuous control tasks demonstrate that DAI improves early-stage performance by over 160\% on average and final performance by more than 50\%, with the Humanoid task showing a 4 improvement early on and a 2 gain at convergence. These results challenge the assumption that complex architectural modifications are necessary for sample-efficient reinforcement learning.

Paper Structure

This paper contains 26 sections, 12 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: Performance comparison of TD3-DAI against TD3 and expert policies across four MuJoCo environments. TD3-DAI significantly outperforms both baselines in all environments. In the early training phase, TD3-DAI substantially improves over TD3, with the most pronounced gains in the Humanoid environment. This advantage persists throughout the training process. Across all environments, TD3-DAI ultimately surpasses TD3 and expert policies by a considerable margin. Data points represent median performance across 6 independent runs with outliers removed using the IQR method. Shaded regions indicate 95% confidence intervals computed via bootstrapping. Orange horizontal lines represent expert policy performance benchmarks.