CAVER: Curious Audiovisual Exploring Robot
Luca Macesanu, Boueny Folefack, Samik Singh, Ruchira Ray, Ben Abbatematteo, Roberto Martín-Martín
TL;DR
The paper tackles the challenge of enabling robots to jointly learn visual appearance and sound properties of objects during interaction. It proposes CAVER, a robot that autonomously builds a growing KNN-based audiovisual representation using a novel 3D-printed impact tool and a curiosity-driven exploration policy that targets visually uncertain regions. The approach supports bi-directional retrieval for audio-to-visual and visual-to-audio tasks and enables downstream capabilities such as audio prediction, material classification, and audio-based imitation without large external datasets. Empirical results across multiple household environments show faster audio-property learning, strong material classification (up to 87%), notable musical imitation (66%), and competitive sound-based manipulation inference, highlighting the practical potential of curiosity-guided audiovisual learning for robust robot perception and manipulation.
Abstract
Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/
