A multi-device dataset for urban acoustic scene classification
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen
TL;DR
The paper introduces the DCASE 2018 Task 1 framework and the TUT Urban Acoustic Scenes 2018 dataset, featuring 10 urban acoustic scenes recorded across six European cities with multiple devices to study channel mismatch. A CNN-based baseline using log-mel features establishes a strong performance benchmark under matched conditions (subtask A) and highlights declines under device mismatch (subtask B) while enabling transfer-learning exploration (subtask C). Key contributions include the publicly released, multi-device dataset and a baseline system, which together facilitate research on robustness, device variance, and cross-domain generalization in urban acoustic scene classification. This work supports practical deployment considerations by modeling real-world recording variability and providing standardized partitions for fair benchmarking.
Abstract
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.
