Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network
Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang
TL;DR
This work tackles sample efficiency in continuous control under suboptimal data by introducing Auto-Regressive Soft Q-Learning (ARSQ), which models cross-dimensional action dependencies through an auto-regressive, entropy-regularized framework. ARSQ combines coarse-to-fine action discretization with dimensional soft advantages to capture interdependencies among action dimensions while leveraging offline demonstrations during online training. Empirically, ARSQ achieves up to a $1.62\times$ improvement over state-of-the-art value-based baselines on suboptimal D4RL data and outperforms baselines on RLBench with expert demonstrations, with strong results in fully offline settings as well. The approach extends Soft Q-learning to encompass dimensional advantages and autoregressive policy construction, offering robust learning from suboptimal data and suggesting further improvements via adaptive discretization and dimension grouping.
Abstract
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data. Project page is at https://sites.google.com/view/ar-soft-q
