Table of Contents
Fetching ...

NFQ2.0: The CartPole Benchmark Revisited

Sascha Lange, Roland Hafner, Martin Riedmiller

TL;DR

NFQ2.0 revisits the neural fitted Q-iteration, showing that a modern batch-learning variant can compete with contemporary Deep RL methods on a real-world CartPole system. By adopting larger networks, continuous single-network training, stacking, and offline/online hybrid strategies (including hindsight relabeling and offline bootstrapping), the approach achieves stable, repeatable learning with relatively small data requirements. The work provides detailed ablations and offline-online comparisons, illustrating how cost shaping, action encoding, and training regimes influence learning speed and robustness, and demonstrates practical techniques for transferring RL to industrial contexts. Collectively, NFQ2.0 offers a practical, open-source, and transferable framework for applying deep RL in real systems, with clear guidance on parameter choices and offline strategies to reduce cycle time and risk.

Abstract

This article revisits the 20-year-old neural fitted Q-iteration (NFQ) algorithm on its classical CartPole benchmark. NFQ was a pioneering approach towards modern Deep Reinforcement Learning (Deep RL) in applying multi-layer neural networks to reinforcement learning for real-world control problems. We explore the algorithm's conceptual simplicity and its transition from online to batch learning, which contributed to its stability. Despite its initial success, NFQ required extensive tuning and was not easily reproducible on real-world control problems. We propose a modernized variant NFQ2.0 and apply it to the CartPole task, concentrating on a real-world system build from standard industrial components, to investigate and improve the learning process's repeatability and robustness. Through ablation studies, we highlight key design decisions and hyperparameters that enhance performance and stability of NFQ2.0 over the original variant. Finally, we demonstrate how our findings can assist practitioners in reproducing and improving results and applying deep reinforcement learning more effectively in industrial contexts.

NFQ2.0: The CartPole Benchmark Revisited

TL;DR

NFQ2.0 revisits the neural fitted Q-iteration, showing that a modern batch-learning variant can compete with contemporary Deep RL methods on a real-world CartPole system. By adopting larger networks, continuous single-network training, stacking, and offline/online hybrid strategies (including hindsight relabeling and offline bootstrapping), the approach achieves stable, repeatable learning with relatively small data requirements. The work provides detailed ablations and offline-online comparisons, illustrating how cost shaping, action encoding, and training regimes influence learning speed and robustness, and demonstrates practical techniques for transferring RL to industrial contexts. Collectively, NFQ2.0 offers a practical, open-source, and transferable framework for applying deep RL in real systems, with clear guidance on parameter choices and offline strategies to reduce cycle time and risk.

Abstract

This article revisits the 20-year-old neural fitted Q-iteration (NFQ) algorithm on its classical CartPole benchmark. NFQ was a pioneering approach towards modern Deep Reinforcement Learning (Deep RL) in applying multi-layer neural networks to reinforcement learning for real-world control problems. We explore the algorithm's conceptual simplicity and its transition from online to batch learning, which contributed to its stability. Despite its initial success, NFQ required extensive tuning and was not easily reproducible on real-world control problems. We propose a modernized variant NFQ2.0 and apply it to the CartPole task, concentrating on a real-world system build from standard industrial components, to investigate and improve the learning process's repeatability and robustness. Through ablation studies, we highlight key design decisions and hyperparameters that enhance performance and stability of NFQ2.0 over the original variant. Finally, we demonstrate how our findings can assist practitioners in reproducing and improving results and applying deep reinforcement learning more effectively in industrial contexts.

Paper Structure

This paper contains 34 sections, 5 equations, 21 figures, 8 tables, 1 algorithm.

Figures (21)

  • Figure 1: Schematics of the CartPole system used for the evaluation of NFQ2.0. The movement of the cart can be influenced by an action $a$ that is either a force applied to the cart (simulation) or a target speed setpoint for the cart (real system). A pole is attached to the cart and can swing freely around the pivot point. The task is to swing up the pole from a downwards hanging position and balance it indefinitely while keeping the cart in the center area of the track. The cart's position $x$, its velocity $\Delta x$, the pole angle $\alpha$ and its angular velocity $\Delta \alpha$ are observed, on the real system with the help of two encoders, one mounted on the pivot point and one attached to the band pulling the cart. The pole angle $\alpha$ in this setup has a non-linearity "jumping" from $-\pi$ to $+\pi$ moving through the downward position. The angle is often replaced by the pair $(sin(\alpha), cos(\alpha))$ in the state representation, in order to avoid this discontinuity and achieve a continuous representation.
  • Figure 2: The real CartPole system for evaluation of NFQ2.0. The system is a commercial product provided by a 3rd party vendor. It is build from non-modified, off-the-shelf industrial automation standard components that are widely used in real plants doing mass production. Therefore, it provides the necessary robustness and durability for doing extensive testing and evaluation of the learned controllers without human oversight. Furthermore, it resembles systems that can be found in real plants, for the goods (standardization, durability) and bads (limits and inflexibility due to black-box and standardized components, communication bloat, latency build-up) and is therefore a very good starting point for the development of learning-based methods and controllers to be used in real plants. The CartPole system consists of a linear actuator driving a cart on a linear rail with the help of a band and a pulley system. A servo motor is use to drive the band and the cart. One encoder is mounted on the pivot point and one is attached to the band pulling the cart. There is one physical switch mounted in front of the motor that is used for the calibration of a zero-position of the cart. The encoders are directly connected to a Programmable Logic Controller (PLC), a Siemens S1200, via a High Frequency (HF) input module. The actuator is driven by a specialized Motor controller, a Siemens V90, that is connected to the PLC via ProfiNet, an Ethernet-based industrial protocol and fieldbus. An industrial computer, also provided by the vendor, in the IPC-form factor (not depicted) is used to run the learning algorithm and the neural control algorithm in real-time. We have chosen the Intel-based option, without any specialized components (no TPU / GPU), because we didn't expect any significant benefits from running the very small models on a GPU, but drawbacks considering CPU and bus performance from the NVIDIA Jetson-based alternative offering. The system runs an Ubuntu LTS. It comes equipped with a specialized interface card offered by Hilscher and a small C++ component providing an easy-to-use ZMQ API for user programs to communicate with the CartPole System via ProfiNet. The PLC has been programmed using Siemens's proprietary programming language, STEP 7 via their software "Tiaportal". We use a vendor-provided minimal program that realizes basic logic for implementing endstops (left side with the switch, right side virtual, depending on encoder reading) to protect the system from damage.
  • Figure 3: The full line of communication between the control computer and the CartPole system. Each box represents a physical component of the system that communicates with the Programmable Logic Controller (PLC) as the central component via ProfiNet, an Ethernet-based fieldbus standard. Within the control computer (IPC), several hardware (Hilscher ProfiNet Adapter) and software components (busy, python user program) participate in the communication. The PLC, the motor controller, busy and the psipy-based user program each establish an internal control loop, reading and writing data. Our user program syncs its internal control loop to busy's control loop, which is responsible for establishing a jitter-free control loop, where the sensor information is provided to the user program at the start and the next command is send at the end of the control loop, preferring a known, constant and jitter-free interval over the lowest possible latency (when sending the command immediately after it becomes available). The other components can't assume to establish any synchronization among each other and run internally at different frequencies of 50 Hz or more. When running busy and NFQ at 20 Hz, the overall latency is about 2 cycles, around 100ms, measured by the time needed for a given set-point change / flank, having any effect to the received encoder measurements. Internal unknown states of the motor controller, such as e.g. a build-up integral component, have been observed to have even later effects after a given user command (set point change).
  • Figure 4: Performance of NFQ2.0 on the CartPole task. Plotted is the average cost per step in a single distinct evaluation episode (greedy evaluation, data not used in training) after n episodes of training. All runs learn a good policy within 120 episodes that manages to swing up and stabilize the pole reliably. Learning curves show consistent improvement towards the best learned policies with only minor stochastic regressions and surprisingly little variance among the individual runs. Thick line: average over 5 independent training runs. Thin lines: individual runs. Blue area: standard deviation.
  • Figure 5: Evaluation of the policy with the lowest average cost in run 5 after 74 episodes. The pole is in the upright position and the cart is at the center of the track after about 60 steps.
  • ...and 16 more figures