ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

Kenneth Ooi; Zhen-Ting Ong; Karn N. Watcharasupat; Bhan Lam; Joo Young Hong; Woon-Seng Gan

ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

Kenneth Ooi, Zhen-Ting Ong, Karn N. Watcharasupat, Bhan Lam, Joo Young Hong, Woon-Seng Gan

TL;DR

ARAUS provides the largest public dataset of affective responses to augmented urban soundscapes, pairing 60 s base soundscapes with 30 s maskers to yield 25,440 labeled audio-visual stimuli across a five-fold cross-validation design plus an independent test set. Grounded in ISO/TS 12913-2:2018, the dataset enables rigorous benchmarking of perceptual models, demonstrated by training four baselines (elastic net, CNN, and two Probabilistic Perceptual Attribute Predictors) with the best test performance from a feature-domain PPAP variant reaching a mean squared error of approximately $0.0838$ for ISO Pleasantness. The methodology combines careful stimulus generation (SMR control, audio-visual alignment), PCA/SOM-based fold allocation to minimize distributional shifts, and comprehensive data-quality checks (consistency metrics, reliability analyses). Overall, ARAUS provides a scalable, reproducible resource for masker-selection research, model benchmarking, and transfer-learning studies in affective soundscape perception, with implications for real-time soundscape augmentation systems and urban acoustic planning.

Abstract

Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding "maskers" (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO 12913-2:2018. Participants also provided relevant demographic information and completed standard psychological questionnaires. We perform exploratory and statistical analysis of the responses obtained to verify internal consistency and agreement with known results in the literature. Finally, we demonstrate the benchmarking capability of the dataset by training and comparing four baseline models for urban soundscape pleasantness: a low-parameter regression model, a high-parameter convolutional neural network, and two attention-based networks in the literature.

ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

TL;DR

for ISO Pleasantness. The methodology combines careful stimulus generation (SMR control, audio-visual alignment), PCA/SOM-based fold allocation to minimize distributional shifts, and comprehensive data-quality checks (consistency metrics, reliability analyses). Overall, ARAUS provides a scalable, reproducible resource for masker-selection research, model benchmarking, and transfer-learning studies in affective soundscape perception, with implications for real-time soundscape augmentation systems and urban acoustic planning.

Abstract

Paper Structure (34 sections, 2 equations, 18 figures, 10 tables)

This paper contains 34 sections, 2 equations, 18 figures, 10 tables.

Introduction
Related Datasets
Large-scale Audio Datasets
Affective Sound Datasets
Data Collection Methodology
Base Urban Soundscapes
Maskers
Fold Allocation
Track Calibration
Acoustic and Psychoacoustic Indicator Computation
Dimensionality Reduction
Clustering and Fold Assignment
Generation of Stimuli
Participant Recruitment
Listening Conditions
...and 19 more sections

Figures (18)

Figure 1: Framework of the study methodology.
Figure 2: Illustration of stimulus generation procedure for a single stimulus.
Figure 3: Test sites at (top left) Academic Media Studio, SUTD, (top right) Media Technology Laboratory, NTU, (bottom left) Demo Room, NTU, (bottom right) Interactive Soundscape Room, NTU. Their noise floors, measured as $L_{\text{A,eq,3-min}}$ values with a B&K Sound Level Meter Type 2240, were 20.6dB, 26.0dB, 36.9dB, and 30.2dB, respectively.
Figure 4: GUI used to administer the ARQ for the ARAUS dataset.
Figure 5: Mean change in ISO Pleasantness value as a function of each of the 287 (280 cross-validation, 7 test set) maskers used to augment soundscapes in the ARAUS dataset, aggregated over soundscapes and SMRs used.
...and 13 more figures

ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

TL;DR

Abstract

ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes

Authors

TL;DR

Abstract

Table of Contents

Figures (18)