Vid2Sid: Videos Can Help Close the Sim2Real Gap

Kevin Qiu; Yu Zhang; Marek Cygan; Josie Hughes

Vid2Sid: Videos Can Help Close the Sim2Real Gap

Kevin Qiu, Yu Zhang, Marek Cygan, Josie Hughes

TL;DR

Vid2Sid is presented, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales.

Abstract

Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.

Vid2Sid: Videos Can Help Close the Sim2Real Gap

TL;DR

Abstract

Paper Structure (85 sections, 3 equations, 11 figures, 22 tables, 1 algorithm)

This paper contains 85 sections, 3 equations, 11 figures, 22 tables, 1 algorithm.

Introduction
Related Work
sim2real Transfer and System Identification
Foundation Models for Robot Perception
VLMs for Physical Reasoning in Robotics
Vid2Sid
Simulation Framework
MuJoCo (Rigid-Body Dynamics)
PyElastica (Continuum Dynamics)
Control Signal Generation
Real-World Capture
Trajectory Extraction from Video
Tentacle (SAM3 Segmentation + Centerline Extraction)
Finger (Tip Extraction)
Error Metrics
...and 70 more sections

Figures (11)

Figure 1: Overview of Vid2Sid. Given paired sim-real videos, the perception layer extracts trajectories (SAM3 centerlines for soft robots, marker tracking for rigid robots), and the reasoning layer uses a VLM to diagnose discrepancies and propose physics parameter updates with natural language rationales. This closed-loop process matches or exceeds black-box optimizers within 10 iterations while requiring no task-specific training or optimizer hyperparameters.
Figure 2: Experimental platforms. (a) CAD model of tendon-driven finger. (b) Physical finger with tracked marker. (c) CAD model of soft tentacle. (d) Underwater tentacle setup. These platforms span rigid-body and continuum dynamics, enabling evaluation across qualitatively different calibration regimes.
Figure 3: Centerline extraction pipeline. (a) Simulated centerline (10 equidistant points, base to tip). (b) Real video frame with SAM3 segmentation mask and extracted centerline overlay. The shared 10-point representation enables direct point-wise error computation between sim and real without manual annotation.
Figure 4: Finger qualitative alignment after Vid2Sid calibration. (a) Simulated finger at $t{=}6$ s. (b) Corresponding real video frame with tracked marker. (c) Tip vertical position over time: calibrated simulation (blue) closely tracks real finger motion (orange). The narrow residual (shaded) confirms that Vid2Sid captures both the amplitude and timing of real finger dynamics.
Figure 5: Ablation study on (a) finger and (b) tentacle. Each bar removes one component. Dots show individual seeds. Removing video increases error by 66% on the finger but decreases it by 38% on the tentacle, indicating that prompt design is domain-dependent.
...and 6 more figures

Vid2Sid: Videos Can Help Close the Sim2Real Gap

TL;DR

Abstract

Vid2Sid: Videos Can Help Close the Sim2Real Gap

Authors

TL;DR

Abstract

Table of Contents

Figures (11)