Table of Contents
Fetching ...

Swim2Real: VLM-Guided System Identification for Sim-to-Real Transfer

Kevin Qiu, Kyle Walker, Mike Y. Michelis, Marek Cygan, Josie Hughes

Abstract

We present Swim2Real, a pipeline that calibrates a 16-parameter robotic fish simulator from swimming videos using vision-language model (VLM) feedback, requiring no hand-designed search stages. Calibrating soft aquatic robots is particularly challenging because nonlinear fluid-structure coupling makes the parameter landscape chaotic, simplified fluid models introduce a persistent sim-to-real gap, and controlled aquatic experiments are difficult to reproduce. Prior work on this platform required three manually tailored stages to handle this complexity. The VLM compares simulated and real videos and proposes parameter updates. A backtracking line search then validates each step size, tripling the accept rate from 14% to 42% by recovering proposals where the direction is correct but the magnitude is too large. Swim2Real calibrates all 16 parameters simultaneously, most closely matching real fish velocities across all motor frequencies (MAE = 7.4 mm/s, 43% lower than the next-best method), with zero outlier seeds across five runs. Motor commands from the trained policy transfer to the physical fish at 50 Hz, completing the pipeline from swimming video to real-world deployment. Downstream RL policies swim 12% farther than those from BayesOpt-calibrated simulators and 90% farther than CMA-ES. These results demonstrate that VLM-guided calibration can close the sim-to-real gap for aquatic robots directly from video, enabling zero-shot RL transfer to physical swimmers without manual system identification, a step toward automated, general-purpose simulator tuning for underwater robotics.

Swim2Real: VLM-Guided System Identification for Sim-to-Real Transfer

Abstract

We present Swim2Real, a pipeline that calibrates a 16-parameter robotic fish simulator from swimming videos using vision-language model (VLM) feedback, requiring no hand-designed search stages. Calibrating soft aquatic robots is particularly challenging because nonlinear fluid-structure coupling makes the parameter landscape chaotic, simplified fluid models introduce a persistent sim-to-real gap, and controlled aquatic experiments are difficult to reproduce. Prior work on this platform required three manually tailored stages to handle this complexity. The VLM compares simulated and real videos and proposes parameter updates. A backtracking line search then validates each step size, tripling the accept rate from 14% to 42% by recovering proposals where the direction is correct but the magnitude is too large. Swim2Real calibrates all 16 parameters simultaneously, most closely matching real fish velocities across all motor frequencies (MAE = 7.4 mm/s, 43% lower than the next-best method), with zero outlier seeds across five runs. Motor commands from the trained policy transfer to the physical fish at 50 Hz, completing the pipeline from swimming video to real-world deployment. Downstream RL policies swim 12% farther than those from BayesOpt-calibrated simulators and 90% farther than CMA-ES. These results demonstrate that VLM-guided calibration can close the sim-to-real gap for aquatic robots directly from video, enabling zero-shot RL transfer to physical swimmers without manual system identification, a step toward automated, general-purpose simulator tuning for underwater robotics.
Paper Structure (23 sections, 3 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Swim2Real calibrates a robotic fish simulator from video and deploys the resulting RL policy on hardware, with no hand-designed search stages. Stage 1: a VLM compares simulated and real swimming videos, proposes parameter adjustments, and a backtracking line search validates the step size, iterating for up to 40 evaluations. Stage 2: the calibrated simulator trains an RL policy that swims 12% farther than BayesOpt-calibrated policies. Motor commands from the trained policy are deployed on the physical fish at 50 Hz.
  • Figure 2: One iteration of the backtracking line search. The VLM diagnoses a physical discrepancy and proposes updated parameters $\theta'$. The full step ($\beta^0\!=\!1.0$) overshoots, but halving the step size ($\beta^1\!=\!0.5$) yields an improvement over $\mathcal{L}_\mathrm{best}$ and is accepted. This triples the accept rate from 14% to 42% compared to evaluating only the full step.
  • Figure 3: (a) CAD cross-section showing the antagonistic tendon arrangement that crosses at the tail midpoint to produce an S-bend. (b) Block diagram of the onboard electronics. (c) The tendon-driven fish robot platform with annotated components. A single motor drives the full range of swimming gaits used for calibration and RL deployment.
  • Figure 4: Chronophotography of swimming at 1.5 Hz (3 snapshots, increasing opacity). Top: real fish (overhead). Middle: Swim2Real-calibrated simulator (51 mm error). Bottom: CMA-ES calibration (123 mm error). Swim2Real reproduces the body shape and forward progression of the real fish, while CMA-ES exhibits incorrect posture and reduced thrust.
  • Figure 5: Best-so-far L2 error (mean across 5 seeds, with dots showing individual seed final values). (a) All five Swim2Real seeds fall within 50.2--53.2 mm, while CMA-ES collapses on 2 of 5. (b) The line search triples the accept rate (42% vs. 14%), so Swim2Real reaches ${\sim}51$ mm in ${\sim}$16 VLM calls while the no-line-search ablation requires 39.
  • ...and 2 more figures