Using reinforcement learning to probe the role of feedback in skill acquisition
Antonio Terpin, Raffaello D'Andrea
TL;DR
The paper investigates how external flow feedback shapes the learning of open-loop skills in a chaotic fluid environment by interfacing a generalist reinforcement learning agent with a tabletop water channel around a spinning cylinder. It demonstrates that dense flow feedback enables rapid learning of high-performance drag-control policies, while the same system can struggle to learn drag-maximizing strategies without feedback, suggesting learning demands richer information than execution. The results reveal an asymmetry between drag-minimization and drag-maximization, likely due to non-minimum-phase dynamics in vortex shedding, and show that learned strategies can be executed in open-loop with little loss in performance, provided feedback is used primarily during training. The study also discusses the role of privileged information and observation-adaptive agents, proposing a framework where an agent selectively uses rich flow measurements to accelerate learning but relies on internal representations during deployment. Overall, the work positions generalist RL as a scientific instrument to probe skill acquisition in complex physical environments and motivates architectures that adaptively manage feedback for learning and execution.
Abstract
Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
