Table of Contents
Fetching ...

CAVER: Curious Audiovisual Exploring Robot

Luca Macesanu, Boueny Folefack, Samik Singh, Ruchira Ray, Ben Abbatematteo, Roberto Martín-Martín

TL;DR

The paper tackles the challenge of enabling robots to jointly learn visual appearance and sound properties of objects during interaction. It proposes CAVER, a robot that autonomously builds a growing KNN-based audiovisual representation using a novel 3D-printed impact tool and a curiosity-driven exploration policy that targets visually uncertain regions. The approach supports bi-directional retrieval for audio-to-visual and visual-to-audio tasks and enables downstream capabilities such as audio prediction, material classification, and audio-based imitation without large external datasets. Empirical results across multiple household environments show faster audio-property learning, strong material classification (up to 87%), notable musical imitation (66%), and competitive sound-based manipulation inference, highlighting the practical potential of curiosity-guided audiovisual learning for robust robot perception and manipulation.

Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/

CAVER: Curious Audiovisual Exploring Robot

TL;DR

The paper tackles the challenge of enabling robots to jointly learn visual appearance and sound properties of objects during interaction. It proposes CAVER, a robot that autonomously builds a growing KNN-based audiovisual representation using a novel 3D-printed impact tool and a curiosity-driven exploration policy that targets visually uncertain regions. The approach supports bi-directional retrieval for audio-to-visual and visual-to-audio tasks and enables downstream capabilities such as audio prediction, material classification, and audio-based imitation without large external datasets. Empirical results across multiple household environments show faster audio-property learning, strong material classification (up to 87%), notable musical imitation (66%), and competitive sound-based manipulation inference, highlighting the practical potential of curiosity-guided audiovisual learning for robust robot perception and manipulation.

Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Curiously Building and Exploiting an Audiovisual Representation with CAVER. CAVER incrementally builds a KNN-based audiovisual representation capable of audio-to-visual and visual-to-audio prediction. To explore an environment, CAVER considers many candidate interaction points (dots), and ranks their uncertainty by comparing their visual features to those of prior samples. CAVER then collects an audio sample of the most uncertain point (red) using our novel impact tool. The audio and visual features are added as a pair to the audiovisual representation in an intrinsically motivated process that results in efficient interactive exploration.
  • Figure 2: Overview of CAVER's audiovisual representation and curious exploration.CAVER curiously and efficiently learns correlations between object visual appearance and acoustic properties. Given candidate hitting points, CAVER uses a KNN model with fine tuned features from foundation vision models to predict corresponding impact sounds. To select informative interaction points, CAVER selects the most uncertain candidate hitting location using distance in visual feature space between a candidate (red) and all prior samples (green) as a proxy for uncertainty. After sampling the most uncertain candidate, the corresponding visual and audio embeddings are paired as two sides of the Audio-Visual representation. This can best be thought of as a bi-directional mapping where audio features can be used to predict visual features, visual features can be used to predict audio features, and concatenating the embeddings gives an informative multimodal representation of that sample point.
  • Figure 3: Robotic Impact Tool. To generate consistent impact sounds, we designed a 3D-printed, spring-loaded impact tool that attaches to the robot's gripper. When the gripper closes, a cam-follower retracts the rod, building tension in the spring until the cam slips and the stored energy in the spring drives the metal rod forward, impacting the object. A directional microphone is mounted near the scene camera to record the resulting sound.
  • Figure 4: Curious Exploration for Audio Prediction. The top row (a) visualizes the objects that comprise each environment. The bottom row (b) shows the average audio prediction error on a held-out test set, measured as the mel-cepstral distortion, as a function of the number of interaction points sampled by the robot. Vertical lines indicate scenes within an environment in which a subset of the objects appear. Each environment is plotted individually (left to right: kitchen, garage, and playroom). CAVER consistently achieves higher prediction accuracy more quickly than the naive exploration baselines via its curious exploration and audiovisual representation.
  • Figure 5: Material Classification Results. We measure the efficacy of CAVER's unified audiovisual embeddings for material classification compared to visual embeddings alone, audio embeddings alone, and a random baseline. The plots show the class-balanced material classification accuracy on a held-out test set as a function of the number of interaction points sampled by the robot, aggregated over 20 runs. Performance is evaluated in the kitchen, garage, and playroom environments, respectively. Incorporating both audio and visual inputs as in CAVER's unified embedding is necessary to achieve strong material classification performance given ambiguous stimuli.
  • ...and 1 more figures