Table of Contents
Fetching ...

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang

TL;DR

This work tackles sample efficiency in continuous control under suboptimal data by introducing Auto-Regressive Soft Q-Learning (ARSQ), which models cross-dimensional action dependencies through an auto-regressive, entropy-regularized framework. ARSQ combines coarse-to-fine action discretization with dimensional soft advantages to capture interdependencies among action dimensions while leveraging offline demonstrations during online training. Empirically, ARSQ achieves up to a $1.62\times$ improvement over state-of-the-art value-based baselines on suboptimal D4RL data and outperforms baselines on RLBench with expert demonstrations, with strong results in fully offline settings as well. The approach extends Soft Q-learning to encompass dimensional advantages and autoregressive policy construction, offering robust learning from suboptimal data and suggesting further improvements via adaptive discretization and dimension grouping.

Abstract

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data. Project page is at https://sites.google.com/view/ar-soft-q

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

TL;DR

This work tackles sample efficiency in continuous control under suboptimal data by introducing Auto-Regressive Soft Q-Learning (ARSQ), which models cross-dimensional action dependencies through an auto-regressive, entropy-regularized framework. ARSQ combines coarse-to-fine action discretization with dimensional soft advantages to capture interdependencies among action dimensions while leveraging offline demonstrations during online training. Empirically, ARSQ achieves up to a improvement over state-of-the-art value-based baselines on suboptimal D4RL data and outperforms baselines on RLBench with expert demonstrations, with strong results in fully offline settings as well. The approach extends Soft Q-learning to encompass dimensional advantages and autoregressive policy construction, offering robust learning from suboptimal data and suggesting further improvements via adaptive discretization and dimension grouping.

Abstract

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data. Project page is at https://sites.google.com/view/ar-soft-q

Paper Structure

This paper contains 47 sections, 1 theorem, 22 equations, 20 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.3

If the dimensional soft advantage $A^d(\mathbf{s}, \mathbf{a}^{-d}, a^d)$ satisfies for all dimension $d$, then the soft advantage can then be expressed as the summation of the dimensional soft advantages

Figures (20)

  • Figure 1: A motivating example of how Q decomposition influences policy training, as detailed in Appendix \ref{['sec:app-example']}.
  • Figure 2: The ARSQ algorithm. The action space is discretized using a coarse-to-fine approach. By predicting dimensional soft advantages, ARSQ generates actions in an auto-regressive manner within a single decision-making step.
  • Figure 3: Network architecture of ARSQ. The soft value $V_{\text{soft}}$ and the dimensional soft advantage $A^d$ are predicted by two separate networks. The advantage network utilizes a shared backbone, and advantage constraints are applied to its output.
  • Figure 4: D4RL main results. mr, m, and me represent medium-replay, medium, and medium-expert, respectively.
  • Figure 5: D4RL results on different demonstration quality averaged over 3 tasks, with each task containing 3 datasets respectively. We report the normalized return provided by D4RL.
  • ...and 15 more figures

Theorems & Definitions (4)

  • Definition 4.1: Soft Advantage
  • Definition 4.2: Dimensional Soft Advantage
  • Theorem 4.3
  • proof