Table of Contents
Fetching ...

SoundCam: A Dataset for Finding Humans Using Room Acoustics

Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu

TL;DR

SoundCam introduces the largest real-world, multi-room dataset of room impulse responses (RIRs) with humans present, enabling learning-based human localization, identification, and detection from acoustic signals. The dataset comprises 5,000 10-channel RIRs and 2,000 10-channel music recordings collected in three rooms, with precise pose annotations obtained from cameras and synchronized hardware. A suite of baselines shows that multichannel deep audio models (notably a multichannel VGGish) achieve localization within about 30 cm and identification around the low- to mid-80s percent under controlled conditions, but generalization to unseen humans and room layouts remains challenging. The work highlights both the potential and the current limitations of acoustic sensing for indoor human tracking, and releases SoundCam under an open license to spur future research in privacy-conscious, audio-based indoor sensing.

Abstract

A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room's acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions.

SoundCam: A Dataset for Finding Humans Using Room Acoustics

TL;DR

SoundCam introduces the largest real-world, multi-room dataset of room impulse responses (RIRs) with humans present, enabling learning-based human localization, identification, and detection from acoustic signals. The dataset comprises 5,000 10-channel RIRs and 2,000 10-channel music recordings collected in three rooms, with precise pose annotations obtained from cameras and synchronized hardware. A suite of baselines shows that multichannel deep audio models (notably a multichannel VGGish) achieve localization within about 30 cm and identification around the low- to mid-80s percent under controlled conditions, but generalization to unseen humans and room layouts remains challenging. The work highlights both the potential and the current limitations of acoustic sensing for indoor human tracking, and releases SoundCam under an open license to spur future research in privacy-conscious, audio-based indoor sensing.

Abstract

A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room's acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions.
Paper Structure (50 sections, 1 equation, 11 figures, 14 tables)

This paper contains 50 sections, 1 equation, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Spectrograms visualizing the RIRs from the Treated Room (left column) and real Living Room (right column), either empty (top row) or with humans standing near the loudspeaker sound source (bottom row). The RIRs within each column are from the same speaker and microphone position. While the human's obstructing the direct path noticeably attenuates the intensity and duration of the RIR as measured by the microphone in each room, the Living Room has much stronger indirect paths for the sound to reach the microphone through reflections and thus shows less obvious effects.
  • Figure 2: Images and visualizations from the real living room. (Left) A photo of the room. (Middle) An aerial view of a 3D scan of the room. (Right) A visualization of the microphone, speaker, and human positions in our dataset. See Appendix \ref{['sec:room_details']} for visualizations of the other rooms.
  • Figure 3: (Left) An aerial view of a 3D scan of the Treated Room. (Right) A visualization of the microphone, speaker, and unique human positions within the Treated Room for our human identification subset.
  • Figure 4: Images from the Treated Room in its empty configuration.
  • Figure 5: Images from the Treated Room in its configuration with fabric panels.
  • ...and 6 more figures