Table of Contents
Fetching ...

Is Value Functions Estimation with Classification Plug-and-play for Offline Reinforcement Learning?

Denis Tarasov, Kirill Brilliantov, Dmitrii Kharlapenko

TL;DR

The results reveal that incorporating this change can lead to superior performance over state-of-the-art solutions for some algorithms in certain tasks, while maintaining comparable performance levels in other tasks, however for other algorithms this modification might lead to the dramatic performance drop.

Abstract

In deep Reinforcement Learning (RL), value functions are typically approximated using deep neural networks and trained via mean squared error regression objectives to fit the true value functions. Recent research has proposed an alternative approach, utilizing the cross-entropy classification objective, which has demonstrated improved performance and scalability of RL algorithms. However, existing study have not extensively benchmarked the effects of this replacement across various domains, as the primary objective was to demonstrate the efficacy of the concept across a broad spectrum of tasks, without delving into in-depth analysis. Our work seeks to empirically investigate the impact of such a replacement in an offline RL setup and analyze the effects of different aspects on performance. Through large-scale experiments conducted across a diverse range of tasks using different algorithms, we aim to gain deeper insights into the implications of this approach. Our results reveal that incorporating this change can lead to superior performance over state-of-the-art solutions for some algorithms in certain tasks, while maintaining comparable performance levels in other tasks, however for other algorithms this modification might lead to the dramatic performance drop. This findings are crucial for further application of classification approach in research and practical tasks.

Is Value Functions Estimation with Classification Plug-and-play for Offline Reinforcement Learning?

TL;DR

The results reveal that incorporating this change can lead to superior performance over state-of-the-art solutions for some algorithms in certain tasks, while maintaining comparable performance levels in other tasks, however for other algorithms this modification might lead to the dramatic performance drop.

Abstract

In deep Reinforcement Learning (RL), value functions are typically approximated using deep neural networks and trained via mean squared error regression objectives to fit the true value functions. Recent research has proposed an alternative approach, utilizing the cross-entropy classification objective, which has demonstrated improved performance and scalability of RL algorithms. However, existing study have not extensively benchmarked the effects of this replacement across various domains, as the primary objective was to demonstrate the efficacy of the concept across a broad spectrum of tasks, without delving into in-depth analysis. Our work seeks to empirically investigate the impact of such a replacement in an offline RL setup and analyze the effects of different aspects on performance. Through large-scale experiments conducted across a diverse range of tasks using different algorithms, we aim to gain deeper insights into the implications of this approach. Our results reveal that incorporating this change can lead to superior performance over state-of-the-art solutions for some algorithms in certain tasks, while maintaining comparable performance levels in other tasks, however for other algorithms this modification might lead to the dramatic performance drop. This findings are crucial for further application of classification approach in research and practical tasks.
Paper Structure (43 sections, 8 equations, 17 figures, 18 tables)

This paper contains 43 sections, 8 equations, 17 figures, 18 tables.

Figures (17)

  • Figure 1: Heatmaps for the impact of the classification parameters averaged over the domains. See \ref{['app:heatmaps']} and \ref{['app:impact']} for more results.
  • Figure 2: Dependency of the algorithms performance on $v_{expand}$ values averaged over domains. See \ref{['app:expand']} for tabular representation.
  • Figure 3: Dependency of the algorithms performance on the number of additional layers averaged over domains. See \ref{['app:scale']} for tabular representation.
  • Figure 4: Q-value functions behaviour for ReBRAC and IQL on AntMaze tasks. Shaded area demonstrates standard deviation across ten random seeds.
  • Figure 5: rliable agarwal2021deep metrics for ReBRAC, IQL, and LB-SAC averaged over all Gym-MuJoCo, AntMaze and Adroit datasets. Ten evaluation seeds are used for ReBRAC and IQL and four seeds for LB-SAC.
  • ...and 12 more figures