Table of Contents
Fetching ...

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Samuel Clarke, Suzannah Wistreich, Yanjie Ze, Jiajun Wu

TL;DR

X-Capture introduces an open-source, portable device that jointly captures RGBD, tactile, and impact audio data from objects in wild environments, enabling object-centric multi-sensory learning at scale. The authors assemble a < $1000 hardware stack and curate a dataset of 3,000 points on 500 real objects across nine environments, with synchronized measurements and post-processed point clouds. Through comprehensive experiments on cross-sensory retrieval, localization, generation, and pretraining transfer, they show that cross-sensory representations benefit from including more modalities and larger, real-world data, with pretraining on X-Capture helping bridge domain gaps to external datasets. The work demonstrates practical utility for pretraining and fine-tuning multi-modal encoders and highlights the potential of scalable, real-world multi-sensory learning for improved object understanding in AI systems.

Abstract

Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

TL;DR

X-Capture introduces an open-source, portable device that jointly captures RGBD, tactile, and impact audio data from objects in wild environments, enabling object-centric multi-sensory learning at scale. The authors assemble a < $1000 hardware stack and curate a dataset of 3,000 points on 500 real objects across nine environments, with synchronized measurements and post-processed point clouds. Through comprehensive experiments on cross-sensory retrieval, localization, generation, and pretraining transfer, they show that cross-sensory representations benefit from including more modalities and larger, real-world data, with pretraining on X-Capture helping bridge domain gaps to external datasets. The work demonstrates practical utility for pretraining and fine-tuning multi-modal encoders and highlights the potential of scalable, real-world multi-sensory learning for improved object understanding in AI systems.

Abstract

Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

Paper Structure

This paper contains 36 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: X-Capture for multi-sensory data capture. (Left) The user captures tactile data from a vase in a living room. (Right) The sensor readings for each modality from the same probed point on the vase, as well as a visualization of the hammer impulse and 3D pose vectors for the image and tactile captures, shown in blue and red, respectively.
  • Figure 2: Exploded view of X-Capture. The device rigidly constrains all sensor assemblies into fixed relative poses on a compact chassis with an ergonomic grip. Wires and circuitry are not shown.
  • Figure 3: Example multi-sensory data points from the X-Capture dataset. Each row shows aligned multi-sensory data captured from a single point on an object in a distinct natural environment: (from left) RGB image centered on the point, depth image, impact audio spectrogram, and tactile image. (Objects, from top: Cat Brush, Insulated Steel Cup, Glass Storage Bowl, Computer Speaker, and Sheet Metal Container.)
  • Figure 4: Comparing test retrieval performance of our cross-modal encoders trained with varying quantities of objects, with each plot grouping results by the query modality used for retrieval. The right-most plot shows an average across all modality combinations.
  • Figure 5: Results of using Shap-E jun2023shap_e to generate 3D neural radiance fields and Stable Diffusion rombach2022stable_diffusion to generate images from outputs of multimodal encoders which have been trained on our data to align to CLIP features. The three left columns show the RGB images, audio spectrograms, and tactile images inputted to their respective encoders. The next three columns show the neural radiance fields generated from using the outputs of the encoders from these RGB, audio, and tactile inputs, respectively, as input to Shap-E. The last three columns similarly show the images generated from using the outputs of the RGB, audio, and tactile inputs, respectively, as input to Stable Diffusion.
  • ...and 6 more figures