Table of Contents
Fetching ...

6DoF SELD: Sound Event Localization and Detection Using Microphones and Motion Tracking Sensors on self-motioning human

Masahiro Yasuda, Shoichiro Saito, Akira Nakayama, Noboru Harada

TL;DR

Experimental results show that the proposed method effectively improves SELD performance with a mechanism to extract acoustic features conditioned by sensor signals, and a multi-modal SELD system that jointly utilizes audio and motion tracking sensor signals.

Abstract

We aim to perform sound event localization and detection (SELD) using wearable equipment for a moving human, such as a pedestrian. Conventional SELD tasks have dealt only with microphone arrays located in static positions. However, self-motion with three rotational and three translational degrees of freedom (6DoF) shall be considered for wearable microphone arrays. A system trained only with a dataset using microphone arrays in a fixed position would be unable to adapt to the fast relative motion of sound events associated with self-motion, resulting in the degradation of SELD performance. To address this, we designed 6DoF SELD Dataset for wearable systems, the first SELD dataset considering the self-motion of microphones. Furthermore, we proposed a multi-modal SELD system that jointly utilizes audio and motion tracking sensor signals. These sensor signals are expected to help the system find useful acoustic cues for SELD on the basis of the current self-motion state. Experimental results on our dataset show that the proposed method effectively improves SELD performance with a mechanism to extract acoustic features conditioned by sensor signals.

6DoF SELD: Sound Event Localization and Detection Using Microphones and Motion Tracking Sensors on self-motioning human

TL;DR

Experimental results show that the proposed method effectively improves SELD performance with a mechanism to extract acoustic features conditioned by sensor signals, and a multi-modal SELD system that jointly utilizes audio and motion tracking sensor signals.

Abstract

We aim to perform sound event localization and detection (SELD) using wearable equipment for a moving human, such as a pedestrian. Conventional SELD tasks have dealt only with microphone arrays located in static positions. However, self-motion with three rotational and three translational degrees of freedom (6DoF) shall be considered for wearable microphone arrays. A system trained only with a dataset using microphone arrays in a fixed position would be unable to adapt to the fast relative motion of sound events associated with self-motion, resulting in the degradation of SELD performance. To address this, we designed 6DoF SELD Dataset for wearable systems, the first SELD dataset considering the self-motion of microphones. Furthermore, we proposed a multi-modal SELD system that jointly utilizes audio and motion tracking sensor signals. These sensor signals are expected to help the system find useful acoustic cues for SELD on the basis of the current self-motion state. Experimental results on our dataset show that the proposed method effectively improves SELD performance with a mechanism to extract acoustic features conditioned by sensor signals.
Paper Structure (11 sections, 1 equation, 3 figures, 3 tables)

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Conventional and our problem settings of SELD
  • Figure 2: Recording setup and equipment configuration for 6DoF SELD Dataset. In (a) and (e), red circles indicate the range of movement of the subject and blue circles indicate the range of the sound source position.
  • Figure 3: (a) MMTM for excitation of acoustic features on the basis of sensor signals. $C_A$ and $C_S$ are the number of channels of the sensor and acoustic features, $F$ is the number of dimensions of the acoustic features, and $\odot$ denotes the Adamar product. (b) Network architecture of proposed multi-modal SELD system. "AmpSpec" denotes the amplitude of the spectrogram.