Table of Contents
Fetching ...

The Sound of Water: Inferring Physical Properties from Pouring Liquids

Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

TL;DR

The work investigates inferring static and dynamic physical properties from the sound of pouring liquids. It builds a physics-informed, two-stage system that first detects fundamental wavelength (pitch) from audio and then recovers properties like air-column length, container dimensions, flow rate, and time-to-fill, using synthetic data pre-training and visual co-supervision for real data. A new large pouring dataset, The Sound of Water 50, enables controlled study and cross-domain generalization, with strong results showing accurate pitch estimation and property recovery, plus shape classification and liquid-weight estimation without direct supervision from real pitch labels. The findings advance multisensory perception in robotics and demonstrate practical applicability to in-the-wild videos, while highlighting generalization limits and avenues for future physics-based audiovisual learning.

Abstract

We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

The Sound of Water: Inferring Physical Properties from Pouring Liquids

TL;DR

The work investigates inferring static and dynamic physical properties from the sound of pouring liquids. It builds a physics-informed, two-stage system that first detects fundamental wavelength (pitch) from audio and then recovers properties like air-column length, container dimensions, flow rate, and time-to-fill, using synthetic data pre-training and visual co-supervision for real data. A new large pouring dataset, The Sound of Water 50, enables controlled study and cross-domain generalization, with strong results showing accurate pitch estimation and property recovery, plus shape classification and liquid-weight estimation without direct supervision from real pitch labels. The findings advance multisensory perception in robotics and demonstrate practical applicability to in-the-wild videos, while highlighting generalization limits and avenues for future physics-based audiovisual learning.

Abstract

We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

Paper Structure

This paper contains 48 sections, 10 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Overview of the problem and approach. We train a pitch detector without any manual supervision and rely on physics to estimate physical properties merely from the sound of water.
  • Figure 2: Demonstration of resonance in liquid pouring. As liquid is poured in the container shown in (a) of height $H$ and radius $R$, a sound made up of an increasing pitch (fundamental frequency) and some (odd) harmonics is observed on the spectrogram shown in (b). Two kinds of resonance are observed: axial (fundamental shown as blue circles in (d), first harmonic as green crosses) and radial (fundamental shown as yellow squares). The wavelength (inverse of frequency) of the axial resonance (shown in (c)) is a function of the length of air column $l(t)$: $\lambda(t)/4 = l(t) + \beta R.$ Interestingly, the high intensity blob around 3s is likely due to the mixture of pitch from both kinds of resonance.
  • Figure 3: Model architecture and training. (a) The audio network is based on a wav2vec2 repurposed for pitch detection. (b) The video network is based on DINO repurposed to operate on image sequences to detect length of air column and container radius (up to a scale factor). (c) The audio network is pre-trained on synthetic samples and then fine-tuned on real samples using physics-inspired co-supervision from the video.
  • Figure 4: Samples of simulated pouring sounds. Our simulator takes in (i) a real sample from the train set as condition, (ii) a random pitch profile and generates synthetic waveform that resembles sound of pouring liquid in a cylindrical container. More samples are shown in \ref{['appendix-subsec:audio-network']}.
  • Figure 5: Examples of containers used in the Sound of Water 50 dataset. The dataset contains videos of pouring liquids in containers with diverse shapes, materials, opacity and background environments. The train set has videos of pouring in transparent cylinder-like containers. Test set I shares the same set of containers but is distinct in terms of the videos. Test set II and II have videos of pouring in entirely unseen containers.
  • ...and 15 more figures