CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception
Pranav N. Thakkar, Shubhangi Sinha, Karan Baijal, Yuhan, Bian, Leah Lackey, Ben Dodson, Heisen Kong, Jueun Kwon, Amber Li, Yifei Hu, Alexios Rekoutis, Tom Silver, Tapomayukh Bhattacharjee
TL;DR
CLAMP addresses the challenge of material and compliance recognition in unstructured environments by introducing a low-cost crowdsourced haptic data-collection device, a large multimodal dataset, and a visuo-haptic perception model. The CLAMP dataset comprises 12.3 million samples from 5 haptic modalities plus vision and language, spanning 5357 household objects, collected by 41 non-expert users with 16 devices, and labeled with 16 material categories. The CLAMP model fuses a haptic encoder (InceptionTime-based) with a GPT-4o visual encoder, trained with a loss $L = \mathcal{L}_{WCE} + \lambda_{KL} \mathcal{L}_{KL}(\mathcal{V} || \mathcal{P})$, and demonstrates superior performance to vision-only baselines, with transfer across three robot embodiments and successful real-world tasks. The results indicate that large-scale, in-the-wild haptic data can enable robust, generalizable visuo-haptic perception for manipulation, while the study also discusses hardware limitations and future directions toward more sensors and end-to-end visuo-haptic models.
Abstract
Robust robot manipulation in unstructured environments often requires understanding object properties that extend beyond geometry, such as material or compliance-properties that can be challenging to infer using vision alone. Multimodal haptic sensing provides a promising avenue for inferring such properties, yet progress has been constrained by the lack of large, diverse, and realistic haptic datasets. In this work, we introduce the CLAMP device, a low-cost (<\$200) sensorized reacher-grabber designed to collect large-scale, in-the-wild multimodal haptic data from non-expert users in everyday settings. We deployed 16 CLAMP devices to 41 participants, resulting in the CLAMP dataset, the largest open-source multimodal haptic dataset to date, comprising 12.3 million datapoints across 5357 household objects. Using this dataset, we train a haptic encoder that can infer material and compliance object properties from multimodal haptic data. We leverage this encoder to create the CLAMP model, a visuo-haptic perception model for material recognition that generalizes to novel objects and three robot embodiments with minimal finetuning. We also demonstrate the effectiveness of our model in three real-world robot manipulation tasks: sorting recyclable and non-recyclable waste, retrieving objects from a cluttered bag, and distinguishing overripe from ripe bananas. Our results show that large-scale, in-the-wild haptic data collection can unlock new capabilities for generalizable robot manipulation. Website: https://emprise.cs.cornell.edu/clamp/
