Table of Contents
Fetching ...

Using reinforcement learning to probe the role of feedback in skill acquisition

Antonio Terpin, Raffaello D'Andrea

TL;DR

The paper investigates how external flow feedback shapes the learning of open-loop skills in a chaotic fluid environment by interfacing a generalist reinforcement learning agent with a tabletop water channel around a spinning cylinder. It demonstrates that dense flow feedback enables rapid learning of high-performance drag-control policies, while the same system can struggle to learn drag-maximizing strategies without feedback, suggesting learning demands richer information than execution. The results reveal an asymmetry between drag-minimization and drag-maximization, likely due to non-minimum-phase dynamics in vortex shedding, and show that learned strategies can be executed in open-loop with little loss in performance, provided feedback is used primarily during training. The study also discusses the role of privileged information and observation-adaptive agents, proposing a framework where an agent selectively uses rich flow measurements to accelerate learning but relies on internal representations during deployment. Overall, the work positions generalist RL as a scientific instrument to probe skill acquisition in complex physical environments and motivates architectures that adaptively manage feedback for learning and execution.

Abstract

Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.

Using reinforcement learning to probe the role of feedback in skill acquisition

TL;DR

The paper investigates how external flow feedback shapes the learning of open-loop skills in a chaotic fluid environment by interfacing a generalist reinforcement learning agent with a tabletop water channel around a spinning cylinder. It demonstrates that dense flow feedback enables rapid learning of high-performance drag-control policies, while the same system can struggle to learn drag-maximizing strategies without feedback, suggesting learning demands richer information than execution. The results reveal an asymmetry between drag-minimization and drag-maximization, likely due to non-minimum-phase dynamics in vortex shedding, and show that learned strategies can be executed in open-loop with little loss in performance, provided feedback is used primarily during training. The study also discusses the role of privileged information and observation-adaptive agents, proposing a framework where an agent selectively uses rich flow measurements to accelerate learning but relies on internal representations during deployment. Overall, the work positions generalist RL as a scientific instrument to probe skill acquisition in complex physical environments and motivates architectures that adaptively manage feedback for learning and execution.

Abstract

Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.

Paper Structure

This paper contains 52 sections, 8 figures.

Figures (8)

  • Figure 1: Wake behind the cylindrical body considered in this study.
  • Figure 2: Top: Our tabletop cwc with close-ups of the main components and its top and side view. The channel consists of an upper and a lower branch. The test section (1) in the upper branch houses the cylinder (2). Three APISQUEEN U5 brushless freshwater propellers (3) drive a left-to-right flow, which is redirected into the lower branch by guide vanes (4) and then back to the upper branch. A honeycomb structure (5) straightens the flow and suppresses large-scale vortices. Flow restrictions (6) accelerate the stream and stretch remaining small-scale vortices. Below: Illustrative snapshots of the real-time flow estimates from the imaging setup (7); colors indicate vorticity (red/blue: positive/negative).
  • Figure 3: System overview. The rl agent interacts directly with the physical setup, commanding the rotation rate of the cylinder, receiving as observations the previous commanded rotation rate, the drag on the cylinder, the motor rotation-rate feedback, and possibly the flow estimate from the camera images of neutrally buoyant tracer particles. We compare learning performance under different observation sets.
  • Figure 4: Running-average drag variation ($\%$ with respect to no control) in the episode over training time (min) for DreamerV3 with and without flow feedback for the drag-maximization (left) and drag-minimization (right) tasks. The solid lines show the mean performance of the model across repetitions, whereas the shaded areas show min-max variability.
  • Figure 5: Running-average drag variation ($\%$ with respect to no control) for the drag-maximization (left) and drag-minimization (right) tasks. Shown are the no-control mean (black dashed), best open-loop mean (gray dotted), rl (blue), and replayed actuation (red, mean over five repeats; shading shows min–max variability). Top: We show the curve over training time (minutes). Bottom: We show the curve for a single rollout, over episode time (seconds).
  • ...and 3 more figures