Sim-Anchored Learning for On-the-Fly Adaptation

Bassel El Mabsout; Shahin Roozkhosh; Siddharth Mysore; Kate Saenko; Renato Mancuso

Sim-Anchored Learning for On-the-Fly Adaptation

Bassel El Mabsout, Shahin Roozkhosh, Siddharth Mysore, Kate Saenko, Renato Mancuso

TL;DR

This work tackles the sim-to-real transfer problem by introducing anchor critics that preserve the simulation-designed priority profile during real-world adaptation. It frames live adaptation as a multi-objective optimization between a source-domain anchor $Q_\Psi$ and a target-domain $Q_\pi$, combining them via a geometric-mean conjunction $J_{\pi_\theta}^{\Psi}$ within Fulfillment Priority Logic. Through sim-to-sim and real-robot experiments—particularly with a quadrotor using the SwaNNFlight stack—the approach demonstrates robust retention of intended behaviors while achieving substantial power savings and smoother control; open-source firmware and tooling are provided to enable on-the-fly adaptation on similar platforms. The key contributions include the anchor-critic formulation, a detailed experimental validation across simulation and real hardware, and the SwaNNFlight/SwaNNLake infrastructure for live policy updates. The results indicate that anchoring adaptation to simulation intent can mitigate catastrophic forgetting and deliver practical improvements in safety, efficiency, and robustness for real-time robotic control.

Abstract

Fine-tuning simulation-trained RL agents with real-world data often degrades crucial behaviors due to limited or skewed data distributions. We argue that designer priorities exist not just in reward functions, but also in simulation design choices like task selection and state initialization. When adapting to real-world data, agents can experience catastrophic forgetting in important but underrepresented scenarios. We propose framing live-adaptation as a multi-objective optimization problem, where policy objectives must be satisfied both in simulation and reality. Our approach leverages critics from simulation as "anchors for design intent" (anchor critics). By jointly optimizing policies against both anchor critics and critics trained on real-world experience, our method enables adaptation while preserving prioritized behaviors from simulation. Evaluations demonstrate robust behavior retention in sim-to-sim benchmarks and a sim-to-real scenario with a racing quadrotor, allowing for power consumption reductions of up to 50% without control loss. We also contribute SwaNNFlight, an open-source firmware for enabling live adaptation on similar robotic platforms.

Sim-Anchored Learning for On-the-Fly Adaptation

TL;DR

and a target-domain

, combining them via a geometric-mean conjunction

within Fulfillment Priority Logic. Through sim-to-sim and real-robot experiments—particularly with a quadrotor using the SwaNNFlight stack—the approach demonstrates robust retention of intended behaviors while achieving substantial power savings and smoother control; open-source firmware and tooling are provided to enable on-the-fly adaptation on similar platforms. The key contributions include the anchor-critic formulation, a detailed experimental validation across simulation and real hardware, and the SwaNNFlight/SwaNNLake infrastructure for live policy updates. The results indicate that anchoring adaptation to simulation intent can mitigate catastrophic forgetting and deliver practical improvements in safety, efficiency, and robustness for real-time robotic control.

Abstract

Paper Structure (21 sections, 10 equations, 11 figures, 7 tables)

This paper contains 21 sections, 10 equations, 11 figures, 7 tables.

Introduction
Related Work
Live Adaptation with Anchor Critics
Data Distribution Impact on Policy Optimization
Illustrating Reward Skew
Anchor Critics
Anchor Critics in Simulation
Testing Anchors on Inverted Pendulum
Catastrophic Forgetting in Reacher
Preventing Catastrophic Forgetting on Gymnasium
Anchor Critics in Reality
Engineering SwaNN Lake for Live NN updates
TRAINING FLIGHT CONTROLLERS WITH SwaNNL
SwaNNFlight System Transceiving Metrics
Catastrophic Forgetting and Safety Concerns in Adaptation without Anchors
...and 6 more sections

Figures (11)

Figure 1: Evolution of the angle ($sin(\theta)$) of an inverted pendulum over time for policies trained with anchor-critics tested on DDPG, SAC and TD3. There are 5 different test runs per goal, and the angle of the pendulum is initialized at random. All 3 anchored algorithms consistently learn to point to 0 rad, thus maintaining a good compromise between our initial goal and our new goal.
Figure 2: Policy $\pi_{\text{S}}$ in (a) is trained in the source domain where the goal is to hold the pendulum stably leaning to the left. We then fine tune $\pi_{\text{S}}$ on the target domain where the goal is to lean the pendulum to the right, producing $\pi_{\text{S}\triangleright\text{T}}$ in (b). We then fine-tune $\pi_{\text{S}}$ again on the target domain but this time anchored on the source domain. This policy, termed $\pi_{\text{S} \overset{{\psi}}{\triangleright}\text{T}}$, is shown in (c). $\pi_{\text{S} \overset{{\psi}}{\triangleright}\text{T}}$ holds the pendulum straight-up with no significant lean indicating that anchors find a compromise between the source and target domains.
Figure 3: The sin of the inverted pendulum's angle $\alpha$ over time starting from $\alpha = 90^\circ$ for a DDPG agent trained initially on Domain A as per Fig. \ref{['fig:left_pendulum']}, and then adapted to Domain B with Domain A anchors as per Fig. \ref{['fig:up_pendulum']}. Note that initial training allows the agent to stably balance the pendulum to approximately $10^\circ$ while adaptation results in the agents driving the pendulum to approximately $0^\circ$.
Figure 4: Evolution of Reacher agents on the source domain over time (arrows showing directionality). We contrast a naively fine-tuned agent with one that has been anchored. The plots show how unstable the naively tuned agent is, while the anchored agent is able to maintain stability.
Figure 5: Distribution of rewards for agents evaluated on source (S) and target (T) domains. Each black dot indicates an agent's performance under a different random seed. Colored arrows indicate each agent's performance before ($\pi_{\text{S}}$) and after fine-tuning naïvely ($\pi_{\text{S}\triangleright\text{T}}$) or with anchors ($\pi_{\text{S} \overset{{\psi}}{\triangleright}\text{T}}$), with green indicating an increase in performance and red indicating a decrease.
...and 6 more figures

Sim-Anchored Learning for On-the-Fly Adaptation

TL;DR

Abstract

Sim-Anchored Learning for On-the-Fly Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)