Table of Contents
Fetching ...

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ruihai Wu, Ce Hao, Chen Gao, Si Liu, Haoran Li, Yilun Chen, Shuicheng Yan, Wenchao Ding

Abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising trajectories across tasks and objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.
Paper Structure (49 sections, 14 equations, 16 figures, 9 tables)

This paper contains 49 sections, 14 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Overview of the proposed visuo-tactile manipulation system. (Left) We introduce OmniViTac, a large-scale visuo-tactile-action aligned dataset for contact-rich manipulation. (Center) We propose OmniVTA, a world model-based visuo-tactile-action framework that predicts future contact states. It seamlessly unifies tactile representation learning, predictive multimodal modeling, adaptive fusion, and reflexive tactile control. (Right) Extensive real-world experiments demonstrate that OmniVTA outperforms prior methods, exhibiting strong robustness and generalization.
  • Figure 2: Overview of OmniViTac dataset.Left: The Cross-Embodiment Data Collection Platform features UNFactory 7-DoF xArm and the TacUMI manipulation interface, both supporting identical end-effectors and diverse tactile sensors (Xense, GelSight Mini, Tac3D, Daimon). Middle: The dataset covers 6 visuo-tactile manipulation patterns, instantiated across 5 semantic scenarios. Top-Right: A scale comparison demonstrates that OmniViTac (21,879 trajectories) significantly exceeds existing visuo-tactile manipulation datasets in tactile-rich data volume. Bottom-Right: The High-quality Data Pipeline ensures reliability through timestamp alignment, visualization, and human-in-the-loop verification.
  • Figure 3: Example visualization of the $6$ visuo-tactile manipulation patterns in OmniViTac. The dataset captures diverse, contact-rich behaviors across various scenarios, including Assembly, Cutting, Adjustment, Peeling, Grasping, and Wiping. Each panel illustrates the global third-person workspace view overlaid with the end-effector's trajectory, alongside synchronized, high-frequency tactile maps (bottom) that continuously record the complex tool-object contact dynamics during task execution.
  • Figure 4: Comprehensive statistical analysis of the OmniViTac.(a) Pattern-level contact area distribution, highlighting the distinct dichotomy between precision-dominant tasks (concentrated in the $0$-$10\%$ range) and surface-dominant tasks (peaking at $70$-$90\%$). (b) Force intensity distribution, showcasing the wide spectrum of force magnitudes required across different manipulation modes. (c) Hierarchical distribution illustrating the rich diversity of $86$ instantiated tasks. (d) Effective contact ratio with variance across patterns, demonstrating the temporal dependency on tactile feedback (e.g., Adjustment exhibits the highest continuous engagement). (e) Total trajectory counts per category, demonstrating the massive scale and balanced composition of the benchmark. (f) t-SNE projection of the high-dimensional tactile signals, revealing physically intuitive and semantically separable latent clusters that strictly align with the underlying contact mechanics of each interaction pattern.
  • Figure 5: System Overview. OmniVTA is a hierarchical slow–fast policy for contact-rich manipulation. The slow policy contains a visuo-tactile world model and an adaptive fusion policy to generate long-horizon action chunks from multi-modal inputs. The fast policy outputs high-frequency refinements at $60$ Hz using tactile feedback. Final actions are a weighted summation of slow-planned and fast-refined outputs, enabling both long-horizon planning and reactive control for robust manipulation.
  • ...and 11 more figures