Table of Contents
Fetching ...

Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

Lipeng Zu, Hansong Zhou, Xiaonan Zhang

TL;DR

BAQ addresses the critical challenge of stable offline-to-online RL by injecting an implicit behavioral cloning model as a behavior reference during online fine-tuning. It introduces a weighted $Q$-learning loss and a BC-divergence prioritized replay mechanism that adaptively downweights OOD updates and prioritizes informative transitions, respectively. The approach yields robust improvements over baselines across standard MuJoCo/D4RL benchmarks, especially in early online stages, and demonstrates stronger stability during the transition from offline to online policy deployment. This framework offers a practical path toward reliable real-world policy adaptation with reduced bootstrap error and improved robustness to distributional shift.

Abstract

Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.

Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

TL;DR

BAQ addresses the critical challenge of stable offline-to-online RL by injecting an implicit behavioral cloning model as a behavior reference during online fine-tuning. It introduces a weighted -learning loss and a BC-divergence prioritized replay mechanism that adaptively downweights OOD updates and prioritizes informative transitions, respectively. The approach yields robust improvements over baselines across standard MuJoCo/D4RL benchmarks, especially in early online stages, and demonstrates stronger stability during the transition from offline to online policy deployment. This framework offers a practical path toward reliable real-world policy adaptation with reduced bootstrap error and improved robustness to distributional shift.

Abstract

Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.

Paper Structure

This paper contains 21 sections, 15 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between the actions generated by the model and those in the offline dataset. (a, b) show the results by the offline-trained model. (c, d) show the results by the BC model trained on the offline dataset.
  • Figure 2: Training processes comparison of IQL, SO2, SUF, and our BAQ across various tasks.
  • Figure 3: Ablation results for showing the performance drop when removing key components.
  • Figure 4: Heatmaps showing the relationship between $k_{\rho}$ and $k_q$ in CQL + Ours settings.
  • Figure 5: Heatmaps showing the relationship between $k_{\rho}$ and $k_q$ in IQL + Ours settings.