Table of Contents
Fetching ...

MicCheck: Repurposing Off-the-Shelf Pin Microphones for Easy and Low-Cost Contact Sensing

Steven Oh, Tai Inui, Magdeline Kuan, Jia-Yeu Lin

TL;DR

This paper tackles the scarcity of tactile feedback in imitation learning for contact-rich robot manipulation by repurposing a low-cost Bluetooth pin microphone as an acoustic tactile sensor. MicCheck mounts the microphone on a 3D-printed gripper to produce informative Mel-spectrogram signals without extra electronics, enabling both perception and control. In material classification, it achieves 92.9% window-level accuracy across 10 classes; in manipulation, adding audio to an imitation-learning pipeline improves a picking-and-pouring task from 0.40 to 0.80 and enables additional contact-rich skills. The work demonstrates that inexpensive acoustic sensing can complement vision and proprioception to provide practical, deployable contact awareness, lowering barriers to multimodal learning in budget robotics.

Abstract

Robotic manipulation tasks are contact-rich, yet most imitation learning (IL) approaches rely primarily on vision, which struggles to capture stiffness, roughness, slip, and other fine interaction cues. Tactile signals can address this gap, but existing sensors often require expensive, delicate, or integration-heavy hardware. In this work, we introduce MicCheck, a plug-and-play acoustic sensing approach that repurposes an off-the-shelf Bluetooth pin microphone as a low-cost contact sensor. The microphone clips into a 3D-printed gripper insert and streams audio via a standard USB receiver, requiring no custom electronics or drivers. Despite its simplicity, the microphone provides signals informative enough for both perception and control. In material classification, it achieves 92.9% accuracy on a 10-class benchmark across four interaction types (tap, knock, slow press, drag). For manipulation, integrating pin microphone into an IL pipeline with open source hardware improves the success rate on picking and pouring task from 0.40 to 0.80 and enables reliable execution of contact-rich skills such as unplugging and sound-based sorting. Compared with high-resolution tactile sensors, pin microphones trade spatial detail for cost and ease of integration, offering a practical pathway for deploying acoustic contact sensing in low-cost robot setups.

MicCheck: Repurposing Off-the-Shelf Pin Microphones for Easy and Low-Cost Contact Sensing

TL;DR

This paper tackles the scarcity of tactile feedback in imitation learning for contact-rich robot manipulation by repurposing a low-cost Bluetooth pin microphone as an acoustic tactile sensor. MicCheck mounts the microphone on a 3D-printed gripper to produce informative Mel-spectrogram signals without extra electronics, enabling both perception and control. In material classification, it achieves 92.9% window-level accuracy across 10 classes; in manipulation, adding audio to an imitation-learning pipeline improves a picking-and-pouring task from 0.40 to 0.80 and enables additional contact-rich skills. The work demonstrates that inexpensive acoustic sensing can complement vision and proprioception to provide practical, deployable contact awareness, lowering barriers to multimodal learning in budget robotics.

Abstract

Robotic manipulation tasks are contact-rich, yet most imitation learning (IL) approaches rely primarily on vision, which struggles to capture stiffness, roughness, slip, and other fine interaction cues. Tactile signals can address this gap, but existing sensors often require expensive, delicate, or integration-heavy hardware. In this work, we introduce MicCheck, a plug-and-play acoustic sensing approach that repurposes an off-the-shelf Bluetooth pin microphone as a low-cost contact sensor. The microphone clips into a 3D-printed gripper insert and streams audio via a standard USB receiver, requiring no custom electronics or drivers. Despite its simplicity, the microphone provides signals informative enough for both perception and control. In material classification, it achieves 92.9% accuracy on a 10-class benchmark across four interaction types (tap, knock, slow press, drag). For manipulation, integrating pin microphone into an IL pipeline with open source hardware improves the success rate on picking and pouring task from 0.40 to 0.80 and enables reliable execution of contact-rich skills such as unplugging and sound-based sorting. Compared with high-resolution tactile sensors, pin microphones trade spatial detail for cost and ease of integration, offering a practical pathway for deploying acoustic contact sensing in low-cost robot setups.

Paper Structure

This paper contains 17 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of MicCheck. We repurpose low-cost pin microphones for contact sensing. We demonstrate this through two experiments: robotic manipulation and object classification.
  • Figure 2: Architecture of the contact-based object classification model. Single-channel Mel spectrograms from 4 types of mic–object interactions (tap, knock, slow, drag) on 9 objects plus a "blank" no-contact class are fed into a compact 2D CNN comprising three Conv–BN–ReLU blocks, followed by global (adaptive) average pooling and a linear classifier. Models were trained with an 8:2 train/validation split (stratified by class) using cross-entropy loss and the Adam optimizer (learning rate $3\times10^{-4}$, batch size 32) for 2000 epochs, with the best checkpoint selected by highest validation accuracy. The blank class in the softmax serves as a rejection threshold for low-evidence windows.
  • Figure 3: Teleoperation setup. We employ the Lerobot SO-101 teleoperation setup with a modified gripper. A commonly found commercial bluetooth microphone is embedded onto the gripper. The microphone is connected to a PC via a wireless USB retriever.
  • Figure 4: Action Chunking with Transformers (ACT) architecture. Training uses a conditional variational autoencoder: a transformer episode/style encoder produces a latent $\mathbf{z}$ and a transformer encoder–decoder predicts a chunk of future actions conditioned on observations and $\mathbf{z}$. At inference, the transformer encoder is omitted to generate actions in fixed-size chunks.
  • Figure 5: Normalized confusion matrix for the 10-class (9 objects + "blank") material classification task. The model shows strong diagonal dominance, with perfect accuracy for the blank class, glass cup, human skin, and steel tumbler. Most confusions occur between acoustically similar soft materials (e.g., plushie vs. leather case, notebook vs. leather case), reflecting challenges in distinguishing objects with overlapping frequency responses.
  • ...and 1 more figures