Table of Contents
Fetching ...

ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

Dongyu Luo, Kelin Yu, Amir-Hossein Shahidzadeh, Cornelia Fermüller, Yiannis Aloimonos, Ruohan Gao

TL;DR

This work tackles the high cost and limited transferability of vision-based tactile data by introducing ControlTac, a two-stage diffusion-based framework that generates realistic tactile images conditioned on a single reference image, a target force $ΔF$, and a target contact position. The first stage builds a force-controlled generation via a diffusion transformer, and the second stage refines the output with a ControlNet conditioned on a contact mask to incorporate position control, yielding physically plausible, diverse tactile images from minimal data. Empirically, ControlTac improves data augmentation for force estimation, pose estimation, and object classification, and enables robust real-world deployment including precise object insertion with high success rates. The modular design supports extending conditioning to additional tactile priors and demonstrates significant practical impact for scalable tactile datasets and robotics manipulation.

Abstract

Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks. To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: https://dongyuluo.github.io/controltac.

ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

TL;DR

This work tackles the high cost and limited transferability of vision-based tactile data by introducing ControlTac, a two-stage diffusion-based framework that generates realistic tactile images conditioned on a single reference image, a target force , and a target contact position. The first stage builds a force-controlled generation via a diffusion transformer, and the second stage refines the output with a ControlNet conditioned on a contact mask to incorporate position control, yielding physically plausible, diverse tactile images from minimal data. Empirically, ControlTac improves data augmentation for force estimation, pose estimation, and object classification, and enables robust real-world deployment including precise object insertion with high success rates. The modular design supports extending conditioning to additional tactile priors and demonstrates significant practical impact for scalable tactile datasets and robotics manipulation.

Abstract

Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks. To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: https://dongyuluo.github.io/controltac.

Paper Structure

This paper contains 34 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustrations of ControlTac's utilities: starting from a single reference image, ControlTac can generate tens of thousands augmented tactile images with various contact forces and contact positions (Left). These augmented images can then be used for various downstream tasks (Middle) and deployed in three real-world experiments (Right).
  • Figure 2: Comparison of tactile data generation approaches. We evaluate whether each method produces visually realistic images, generates varied outputs from a single input (rather than collapsing to a mean image), and allows control via physical inputs. We compare ControlTac with three other directions: Text2Tactile texttoucherbinding, Visual2Tactile visgeltarftouchinganerf, and Simulation tactotaximdifftactile.
  • Figure 3: Illustration of our controllable tactile generation framework.a) The Force-Control component of ControlTac. We input the tactile image $\mathbf{x}$ without background $\mathbf{B}$ into the DiT, which is conditioned on the 3D force $\mathbf{\Delta F}$. b) The Position-Control component of ControlTac. We copy the DiT from the first stage and finetune it with ControlNet conditioned on the contact mask $\mathbf{c}$ for generating realistic tactile image $\mathbf{y_B}$ conditioned on different force and contact positions.
  • Figure 4: Qualitative Generation Results. The first column displays 3D previews of six objects, followed by the input tactile image (Ref. Image) in the second column and the Contact Mask in the third column. The fourth column shows the initial force (top) and target force (bottom). Subsequent columns depict the Ground Truth (G.T.) and results from ControlTac, the hybrid force-position conditional diffusion model (Hybrid), and the separate-control pipeline (Separate). In part A), we visualize the generated images for comparison; in part B), we visualize the error maps highlighting the differences from the ground-truth tactile image. Complete results and force-only generation results are shown in Fig. \ref{['fig:error_map']} and Fig. \ref{['fig:only_force_gen']} respectively in Appendix \ref{['app:vis']}.
  • Figure 5: Force estimation performance (MAE) across different quantities of real and generated data. The normal force range is 1–10 N.
  • ...and 6 more figures