Table of Contents
Fetching ...

TwinTrack: Bridging Vision and Contact Physics for Real-Time Tracking of Unknown Dynamic Objects

Wen Yang, Zhixian Xie, Xuechao Zhang, Heni Ben Amor, Shan Lin, Wanxin Jin

TL;DR

TwinTrack addresses real-time 6-DoF tracking of unknown dynamic objects in contact-rich scenes by bridging vision with contact physics. It introduces Real2Sim to learn geometry and contact dynamics from RGB-D data and Sim2Real to perform physics-aware tracking via adaptive fusion of visual cues and learned dynamics; the system is GPU-accelerated and uses a neural SDF-augmented collision model. A key contribution is the collision geometry compensation delta, learned alongside mass, inertia, and friction through a sampling-based optimization (CEM) that handles non-smooth contact events. The framework achieves robust, real-time tracking (>20 Hz) in falling and in-hand manipulation scenarios, reducing occlusion and motion-blur effects and improving alignment between perception and physical reality for downstream control tasks.

Abstract

Real-time tracking of previously unseen, highly dynamic objects in contact-rich environments -- such as during dexterous in-hand manipulation -- remains a significant challenge. Purely vision-based tracking often suffers from heavy occlusions due to the frequent contact interactions and motion blur caused by abrupt motion during contact impacts. We propose TwinTrack, a physics-aware visual tracking framework that enables robust and real-time 6-DoF pose tracking of unknown dynamic objects in a contact-rich scene by leveraging the contact physics of the observed scene. At the core of TwinTrack is an integration of Real2Sim and Sim2Real. In Real2Sim, we combine the complementary strengths of vision and contact physics to estimate object's collision geometry and physical properties: object's geometry is first reconstructed from vision, then updated along with other physical parameters from contact dynamics for physical accuracy. In Sim2Real, robust pose estimation of the object is achieved by adaptive fusion between visual tracking and prediction of the learned contact physics. TwinTrack is built on a GPU-accelerated, deeply customized physics engine to ensure real-time performance. We evaluate our method on two contact-rich scenarios: object falling with rich contact impacts against the environment, and contact-rich in-hand manipulation. Experimental results demonstrate that, compared to baseline methods, TwinTrack achieves significantly more robust, accurate, and real-time 6-DoF tracking in these challenging scenarios, with tracking speed exceeding 20 Hz. Project page: https://irislab.tech/TwinTrack-webpage/

TwinTrack: Bridging Vision and Contact Physics for Real-Time Tracking of Unknown Dynamic Objects

TL;DR

TwinTrack addresses real-time 6-DoF tracking of unknown dynamic objects in contact-rich scenes by bridging vision with contact physics. It introduces Real2Sim to learn geometry and contact dynamics from RGB-D data and Sim2Real to perform physics-aware tracking via adaptive fusion of visual cues and learned dynamics; the system is GPU-accelerated and uses a neural SDF-augmented collision model. A key contribution is the collision geometry compensation delta, learned alongside mass, inertia, and friction through a sampling-based optimization (CEM) that handles non-smooth contact events. The framework achieves robust, real-time tracking (>20 Hz) in falling and in-hand manipulation scenarios, reducing occlusion and motion-blur effects and improving alignment between perception and physical reality for downstream control tasks.

Abstract

Real-time tracking of previously unseen, highly dynamic objects in contact-rich environments -- such as during dexterous in-hand manipulation -- remains a significant challenge. Purely vision-based tracking often suffers from heavy occlusions due to the frequent contact interactions and motion blur caused by abrupt motion during contact impacts. We propose TwinTrack, a physics-aware visual tracking framework that enables robust and real-time 6-DoF pose tracking of unknown dynamic objects in a contact-rich scene by leveraging the contact physics of the observed scene. At the core of TwinTrack is an integration of Real2Sim and Sim2Real. In Real2Sim, we combine the complementary strengths of vision and contact physics to estimate object's collision geometry and physical properties: object's geometry is first reconstructed from vision, then updated along with other physical parameters from contact dynamics for physical accuracy. In Sim2Real, robust pose estimation of the object is achieved by adaptive fusion between visual tracking and prediction of the learned contact physics. TwinTrack is built on a GPU-accelerated, deeply customized physics engine to ensure real-time performance. We evaluate our method on two contact-rich scenarios: object falling with rich contact impacts against the environment, and contact-rich in-hand manipulation. Experimental results demonstrate that, compared to baseline methods, TwinTrack achieves significantly more robust, accurate, and real-time 6-DoF tracking in these challenging scenarios, with tracking speed exceeding 20 Hz. Project page: https://irislab.tech/TwinTrack-webpage/

Paper Structure

This paper contains 25 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of TwinTrack framework. Our framework includes two main components: Real2Sim for learning object geometry and contact physics, and Sim2Real for physics-aware real-time pose tracking. In Real2Sim, object visual geometry, represented as the Gaussian Splatting model, is obtained from a selection of keyframes by jointly optimizing the geometry and keyframes. The obtained geometry is continually updated in the next phase of learning contact dynamics, together with identifying other physical properties. In Sim2Real, feature correspondence is performed for each new frame with respect to an optimized keyframe from Real2Sim; meanwhile the learned contact dynamics also predicts the current object pose. The final object pose is an adaptive fusion of both visual tracking and dynamics prediction.
  • Figure 2: Visual geometry estimation. Left: joint optimization of Gaussian Splatting (GS) model $\mathcal{G}_{\text{gs}}$ and keyframe poses. Middle: depth rendering from the obtained GS model. Right: learning neural SDF $\mathop{\mathrm{\texttt{SDF}}}\nolimits_{{\text{vision}}}$ from the rendered depth images.
  • Figure 3: Collision geometry is a combination of visual geometry plus a learnable geometry compensation.
  • Figure 4: Adaptive weighting parameter
  • Figure 5: Two contact-rich scenarios with different objects.
  • ...and 2 more figures