X-Capture: An Open-Source Portable Device for Multi-Sensory Learning
Samuel Clarke, Suzannah Wistreich, Yanjie Ze, Jiajun Wu
TL;DR
X-Capture introduces an open-source, portable device that jointly captures RGBD, tactile, and impact audio data from objects in wild environments, enabling object-centric multi-sensory learning at scale. The authors assemble a < $1000 hardware stack and curate a dataset of 3,000 points on 500 real objects across nine environments, with synchronized measurements and post-processed point clouds. Through comprehensive experiments on cross-sensory retrieval, localization, generation, and pretraining transfer, they show that cross-sensory representations benefit from including more modalities and larger, real-world data, with pretraining on X-Capture helping bridge domain gaps to external datasets. The work demonstrates practical utility for pretraining and fine-tuning multi-modal encoders and highlights the potential of scalable, real-world multi-sensory learning for improved object understanding in AI systems.
Abstract
Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
