Table of Contents
Fetching ...

An Invitation to Deep Reinforcement Learning

Bernhard Jaeger, Andreas Geiger

TL;DR

The paper reframes deep reinforcement learning as a generalization of supervised learning to non-differentiable objectives, aiming to lower the entry barrier for practitioners. It surveys core off-policy and on-policy approaches, detailing value-learning with Q-functions (e.g., Q-learning, SAC) and policy-gradient methods (e.g., REINFORCE, PPO) with practical examples and algorithmic insights. It emphasizes data collection challenges such as compounding errors, exploration strategies, and replay buffers, and presents robust methods (SAC, PPO) that have become standard for deep RL. The result is an accessible, optimization-focused tutorial that connects classical RL ideas to modern deep RL techniques and their broad applicability to perception, control, and generative systems.

Abstract

Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.

An Invitation to Deep Reinforcement Learning

TL;DR

The paper reframes deep reinforcement learning as a generalization of supervised learning to non-differentiable objectives, aiming to lower the entry barrier for practitioners. It surveys core off-policy and on-policy approaches, detailing value-learning with Q-functions (e.g., Q-learning, SAC) and policy-gradient methods (e.g., REINFORCE, PPO) with practical examples and algorithmic insights. It emphasizes data collection challenges such as compounding errors, exploration strategies, and replay buffers, and presents robust methods (SAC, PPO) that have become standard for deep RL. The result is an accessible, optimization-focused tutorial that connects classical RL ideas to modern deep RL techniques and their broad applicability to perception, control, and generative systems.

Abstract

Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.
Paper Structure (38 sections, 47 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 38 sections, 47 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: An Invitation to Deep Reinforcement Learning. This tutorial is structured as follows: We start by introducing reinforcement learning techniques through the lens of optimizing non-differentiable metrics for single step problems in Section \ref{['sec:optimization']}. In particular, we discuss value learning in Section \ref{['sec:value_learning']} and stochastic policy gradients in Section \ref{['sec:policy_gradients']}. For each category of algorithms, we provide a simple example assuming a fixed labeled dataset, thereby connecting RL to supervised learning objectives. This assumption is lifted in Section \ref{['sec:data_collection']} where we discuss data collection for sequential decision making problems. Next, we extend the techniques from Section \ref{['sec:optimization']} to sequential (multi-step) decision making problems. More specifically, we extend value learning to off-policy RL in Section \ref{['sec:off_policy_learning']} and stochastic policy gradients to on-policy RL in Section \ref{['sec:on_policy_learning']}. For both paradigms, we introduce basic learning algorithms (TD-Learning, REINFORCE), discuss common problems and solutions, and introduce a modern advanced algorithm (SAC, PPO).
  • Figure 2: Q-Functions. We illustrate the predicted reward of a Q-function for a fixed state. (\ref{['fig:disc_q']}) Discrete action space with 5 classes. The best action can be selected by computing all 5 Q-values (sequentially or in parallel). (\ref{['fig:cont_q']}) 1-dimensional continuous action space. The maximum value cannot easily be found since the Q-function can only be evaluated at a finite amount of points. Instead, a policy network predicts the action with the highest reward. The policy is improved by following the gradient of the Q-function uphill.
  • Figure 3: Optimization of Non-Differentiable Objectives. We compare Q-learning (continuous setting/actor-critic) in (\ref{['fig:q_learning']}) to stochastic policy gradients in (\ref{['fig:policy_gradient']}). Note that both Q-learning and stochastic policy gradients do not require differentiation of the environment.
  • Figure 4: Compounding Error Problem. Small mistakes lead to Non-IID states that increase the error.
  • Figure 5: PPO-Clip objective. Top row illustrates positive, bottom row negative advantages. The columns illustrate different probabilities of the data collecting policy $\pi_{\beta}$. Optimization moves upwards. PPO-Clip clips the objective if $\pi$ moved too far upwards compared to $\pi_{\beta}$. Here, we use a clipping threshold $\psi = 0.2$
  • ...and 3 more figures