Table of Contents
Fetching ...

Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

Yen Cheng Chang, Jesse Codling, Yiwen Dong, Jiale Zhang, Jiasi Chen, Hae Young Noh, Pei Zhang

TL;DR

Crowd monitoring in large venues faces privacy and data-scarcity challenges when relying solely on cameras, microphones, or vibration sensors. The paper presents ViLA, a cross-modality learning framework that pre-trains a vibration-focused encoder on unlabeled audio data and then fine-tunes with limited vibration data to classify crowd behaviors. Central to ViLA are the Similarity and Diversity indicators, which guide modality selection and predict transfer effectiveness between audio and vibration domains. Real-world evaluations in two stadiums show substantial improvements over non-audio baselines, including up to $5.8\times$ error reduction, highlighting a practical pathway for privacy-preserving, scalable crowd monitoring in large public spaces.

Abstract

Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However, since the vibration-based crowd monitoring approach is newly developed, one main challenge is the lack of training data due to sports stadiums being large public spaces with complex physical activities. In this paper, we present ViLA (Vibration Leverage Audio), a vibration-based method that reduces the dependency on labeled data by pre-training with unlabeled cross-modality data. ViLA is first pre-trained on audio data in an unsupervised manner and then fine-tuned with a minimal amount of in-domain vibration data. By leveraging publicly available audio datasets, ViLA learns the wave behaviors from audio and then adapts the representation to vibration, reducing the reliance on domain-specific vibration data. Our real-world experiments demonstrate that pre-training the vibration model using publicly available audio data (YouTube8M) achieved up to a 5.8x error reduction compared to the model without audio pre-training.

Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

TL;DR

Crowd monitoring in large venues faces privacy and data-scarcity challenges when relying solely on cameras, microphones, or vibration sensors. The paper presents ViLA, a cross-modality learning framework that pre-trains a vibration-focused encoder on unlabeled audio data and then fine-tunes with limited vibration data to classify crowd behaviors. Central to ViLA are the Similarity and Diversity indicators, which guide modality selection and predict transfer effectiveness between audio and vibration domains. Real-world evaluations in two stadiums show substantial improvements over non-audio baselines, including up to error reduction, highlighting a practical pathway for privacy-preserving, scalable crowd monitoring in large public spaces.

Abstract

Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However, since the vibration-based crowd monitoring approach is newly developed, one main challenge is the lack of training data due to sports stadiums being large public spaces with complex physical activities. In this paper, we present ViLA (Vibration Leverage Audio), a vibration-based method that reduces the dependency on labeled data by pre-training with unlabeled cross-modality data. ViLA is first pre-trained on audio data in an unsupervised manner and then fine-tuned with a minimal amount of in-domain vibration data. By leveraging publicly available audio datasets, ViLA learns the wave behaviors from audio and then adapts the representation to vibration, reducing the reliance on domain-specific vibration data. Our real-world experiments demonstrate that pre-training the vibration model using publicly available audio data (YouTube8M) achieved up to a 5.8x error reduction compared to the model without audio pre-training.

Paper Structure

This paper contains 20 sections, 2 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: An overview of the football stadium. The complex crowd dynamics and high noise levels make crowd monitoring a challenging task.
  • Figure 2: Examples of audio and vibration spectrograms. The horizontal axis represents time (seconds), and the vertical axis represents frequency. Both spectrograms exhibit many similar patterns.
  • Figure 3: The overview of the ViLA training process. Initially, unsupervised audio pre-training is performed. This is followed by supervised vibration fine-tuning using the encoder learned from the pre-training stage.
  • Figure 4: The illustration of audio spectrogram transformation parameter settings. Audio signals are down-sampled from 16 kHz to 1 kHz and converted from 128-bin to 32-bin Mel-spectrograms to prevent null values that can arise from using high bin sizes with low vibration sampling rates.
  • Figure 5: The illustration of masking parameter settings. We used a sample rate of 1 kHz and a patch size of 4×4 to avoid null values in the vibration spectrum, which can occur with lower sampling rates compared to audio signals.
  • ...and 11 more figures