Table of Contents
Fetching ...

AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception

Ruoxuan Feng, Yuxuan Zhou, Siyu Mei, Dongzhan Zhou, Pengwei Wang, Shaowei Cui, Bin Fang, Guocai Yao, Di Hu

TL;DR

This work tackles the insufficiency of dynamic tactile data and perception models by introducing the ToucHD dataset and AnyTouch 2, a general representation learning framework for diverse optical tactile sensors. The authors organize tactile data into a five-tier dynamic pyramid and implement curriculum-guided training with pixel-level, semantic-level, and physical-level objectives, including frame-difference reconstruction, cross-sensor and action matching, and touch–force prediction. Empirical results show strong performance across static and dynamic tactile perception benchmarks and real-world manipulation tasks, with ablations confirming the value of each module and the high-tier ToucHD data. The approach enables sensor-invariant, force-aware, dexterous manipulation and provides a scalable foundation for dynamic tactile intelligence in robotics.

Abstract

Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.

AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception

TL;DR

This work tackles the insufficiency of dynamic tactile data and perception models by introducing the ToucHD dataset and AnyTouch 2, a general representation learning framework for diverse optical tactile sensors. The authors organize tactile data into a five-tier dynamic pyramid and implement curriculum-guided training with pixel-level, semantic-level, and physical-level objectives, including frame-difference reconstruction, cross-sensor and action matching, and touch–force prediction. Empirical results show strong performance across static and dynamic tactile perception benchmarks and real-world manipulation tasks, with ablations confirming the value of each module and the high-tier ToucHD data. The approach enables sensor-invariant, force-aware, dexterous manipulation and provides a scalable foundation for dynamic tactile intelligence in robotics.

Abstract

Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
Paper Structure (34 sections, 9 equations, 17 figures, 16 tables)

This paper contains 34 sections, 9 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Tactile Dynamic Pyramid and ToucHD dataset. We organize tactile pre-training data into 5 tiers based on data rarity and the complexity of the dynamic perception capabilities they support. The datasets shown in black font are existing ones. Most current datasets fall into the lower tiers (4 and 5), while higher tiers (1, 2, and 3) remain notably scarce. To bridge this gap, we present ToucHD, a large-scale hierarchical dynamic tactile dataset spanning tactile atomic actions, real-world manipulations, and touch–force paired data. ToucHD is designed to enrich high-tier tactile data and establish a complete dynamic tactile data ecosystem, thereby comprehensively supporting dynamic tactile perception.
  • Figure 2: Overview of AnyTouch 2. Our model unifies object-level tactile semantics with fine-grained dynamic and physical perception, learning a general tactile representation that supports a broad spectrum of downstream tasks. By incorporating multi-level dynamic enhanced modules aligned with the tiers of the tactile dynamic pyramid, it strengthens sensitivity to subtle tactile variations and improves reasoning about the physical properties underlying dynamic interactions.
  • Figure 3: Real-world manipulation tasks. We evaluate models on real-world manipulation tasks that span the dynamic capabilities of different tiers in our tactile dynamic pyramid: Tactile Grasping (Tier 5), Whiteboard Wiping (Tiers 4 & 3), USB Insertion (Tier 2), and Chip Moving (Tier 1).
  • Figure 4: Evaluation of real-world manipulation tasks. This evaluation spans DIGIT and GelSight Mini. Each dynamic model that takes consecutive tactile frames as input has a corresponding dynamic tier, which denotes the highest level of the training data and objectives used in our tactile dynamic pyramid shown in Fig. \ref{['fig:data']}, reflecting the model's dynamic perception capability. $\dag$ denotes additional training data including ToucHD.
  • Figure 5: Simulated data acquisition. 3D object models are processed using an IMPM optical tactile simulation platform, which comprises two components: the IMPM simulator and a Blender-based rendering module. Firstly, the IMPM simulator generates 3D elastomer models that capture deformations caused by object rotations and sliding motions. The Blender-based rendering module then converts these elastomer models into tactile images for different optical sensors.
  • ...and 12 more figures