Table of Contents
Fetching ...

SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception

Gurmeher Khurana, Lan Wei, Dandan Zhang

TL;DR

SARL addresses the need for geometry-aware representations in contact-rich manipulation by leveraging fused visuo-tactile data and augmenting BYOL with three map-level losses that preserve spatial structure. SAL, PPDA, and RAM enforce attentional, semantic, and geometric consistency on intermediate feature maps, yielding richer representations than global invariance alone. Across six downstream tasks and nine SSL baselines, SARL demonstrates substantial gains, particularly on geometry-sensitive tasks, and shows strong transfer to unseen visuo-tactile datasets, underscoring the value of spatial equivariance for manipulation-ready perception.

Abstract

Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception

TL;DR

SARL addresses the need for geometry-aware representations in contact-rich manipulation by leveraging fused visuo-tactile data and augmenting BYOL with three map-level losses that preserve spatial structure. SAL, PPDA, and RAM enforce attentional, semantic, and geometric consistency on intermediate feature maps, yielding richer representations than global invariance alone. Across six downstream tasks and nine SSL baselines, SARL demonstrates substantial gains, particularly on geometry-sensitive tasks, and shows strong transfer to unseen visuo-tactile datasets, underscoring the value of spatial equivariance for manipulation-ready perception.

Abstract

Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

Paper Structure

This paper contains 27 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Concept overview: SARL augments a Joint-Embedding Architecture with Spatially-Aware losses (SAL, PPDA and RAM) on intermediate feature layers, complementing a Global loss on final embeddings to preserve spatial and geometric cues typically removed by global pooling, yielding manipulation-ready representations.
  • Figure 2: SARL architecture. Input $x$ is augmented to $x_1,x_2$ and processed by parallel Online ($\theta$) and Target ($\xi$) branches. The global loss $\mathcal{L}_{\text{global}}$ trains $q_\theta$ to match $g_\xi$; Target weights are an EMA of Online and use stop-grad ($\mathrm{sg}$). From encoder features $f_\theta,f_\xi$, SARL adds spatial losses: $\mathcal{L}_{SAL}$ (Layers 2–4), $\mathcal{L}_{PPDA}$ (Layer 3 with a $7 \times 7$ grid and $K=32$ prototypes), and $\mathcal{L}_{RAM}$ (Layer 3 with a $6 \times 6$ grid).
  • Figure 3: Overview of the datasets used for pre-training, evaluation, and transfer learning. (a) Sample images from the primary ViTacTip dataset, showcasing the fused visuo-tactile data across several downstream tasks. (b) The three data modalities used in our ablation studies: the fused multimodal ViTacTip data (top row), its visual-only ViTac counterpart (middle row), and its marker-only TacTip counterpart (bottom row). (c) Sample images from the four unseen datasets used for the generalizability evaluation: (i) MagicTac-T, (ii) MagicTac-D, (iii) GelSight, and (iv) PnuTac.
  • Figure 4: Linear probe performance on the Edge Pose Regression task, comparing SARL (bottom row) against the fully supervised baseline (top row). Each plot shows the predicted vs. true values for the three pose components: horizontal distance ($X$ in mm), press depth ($Z$ in mm), and angle of rotation ($\theta$ in degrees).