Table of Contents
Fetching ...

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

TL;DR

OpenTouch introduces the first in-the-wild, egocentric full-hand tactile dataset capturing synchronized vision, contact forces, and hand pose. The authors provide a low-cost, open tactile glove and a comprehensive collection/annotation pipeline across diverse environments, plus benchmarks for cross-sensory retrieval and tactile-based grasp classification. Results show tactile signals are compact yet highly informative for grasp understanding and can improve cross-modal alignment when combined with vision and pose, with temporal context and encoder design significantly impacting performance. This dataset and benchmarks enable scalable research in touch-grounded perception and robotic manipulation, bridging vision and tactile sensing in real-world manipulation scenarios.

Abstract

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

TL;DR

OpenTouch introduces the first in-the-wild, egocentric full-hand tactile dataset capturing synchronized vision, contact forces, and hand pose. The authors provide a low-cost, open tactile glove and a comprehensive collection/annotation pipeline across diverse environments, plus benchmarks for cross-sensory retrieval and tactile-based grasp classification. Results show tactile signals are compact yet highly informative for grasp understanding and can improve cross-modal alignment when combined with vision and pose, with temporal context and encoder design significantly impacting performance. This dataset and benchmarks enable scalable research in touch-grounded perception and robotic manipulation, bridging vision and tactile sensing in real-world manipulation scenarios.

Abstract

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

Paper Structure

This paper contains 37 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: OpenTouch is the first in-the-wild, full-hand tactile dataset with synchronized egocentric video, force-aware full-hand touch, and hand-pose trajectories.OpenTouch provides: (a) tactile signals reveal rich, time-varying contact and support forces that are indistinguishable from vision alone, even under nearly identical grasp poses. (b) hardware-based, robust hand-pose tracking, and extensive text annotations. (c) 5 hours of recordings, including 3 hours of densely annotated, contact-rich interactions. See the supplementary video for the sensitivity and fidelity of the tactile dynamics.
  • Figure 2: (a) Sankey diagram visualizing the distribution of dataset labels, including environment, action, grasp type, and object category. See the full list of actions and grasp types in the Supp. Mat. (b) Accumulated tactile maps across dataset for different grasp types. The spatial pressure patterns strongly correlate with the underlying grasp configuration, demonstrating the accuracy and quality of our tactile data and grasp type annotation. See the Supp. Mat. for complete tactile–grasp mappings covering all grasp taxonomies.
  • Figure 3: Example data from OpenTouch demonstrates that hardware-based tactile sensing and pose tracking reveal critical force, contact, and motion cues that vision alone cannot capture. (a) Although the first three frames show nearly identical hand poses, the tactile signals reveal that in the third frame the hand applies sufficient force to move the chair. (b) In the first frame, tactile readings clearly indicate contact with the table, ambiguous from RGB alone. In the next two frames, the hand moves out of view, making vision-based pose estimation unreliable; OpenTouch provides accurate hardware-tracked poses throughout. (c) Tactile sensing exposes clear interaction patterns with transparent object that remain difficult to infer from visual tracking alone. (d) The tactile map captures a subtle middle-finger double-click on a button, a fine-grained motion that even pose tracking may miss. See the supplementary video for the high-fidelity tactile signals and subtle dynamic patterns.
  • Figure 4: Overview of the data-capture and annotation setup. Meta Aria glasses, Rokoko Smartgloves, and the FPC-based tactile sensor are synchronized at 30 Hz with an average 2 ms latency, enabled by a zero-potential readout circuit and lightweight ESP-NOW wireless transmission. The system captures synchronized egocentric video, hand pose, and dense full-hand touch signals. High-level descriptions and detailed annotations are automatically generated from the egocentric video and the rendered tactile maps using a large language model.
  • Figure 5: Qualitative retrieval results. Each block shows the query, the ground-truth, and the retrieved target using five frames selected from the video sequence for visualization only. Left: Video-to-tactile retrieval. Top result exhibits a highly similar pressure distribution. Right: Tactile-to-video retrieval. Videos depicting similar interactions are retrieved from tactile input. The top block corresponds to grasping and flipping a round object, while the bottom block shows placing an object down.
  • ...and 6 more figures