Table of Contents
Fetching ...

JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection

Hyeonuk Nam, Yong-Hwa Park

TL;DR

JiTTER tackles the challenge of precise sound event boundary detection by replacing masked block prediction with hierarchical temporal shuffle reconstruction, forcing a transformer to recover the correct sequence at both coarse and fine temporal scales. The method introduces block-level and frame-level shuffles, plus subtle noise, and optimizes a dual objective that combines self-supervised reconstruction with supervised SED fine-tuning. Ablations on the DESED dataset show that jointly leveraging both perturbation levels yields the largest gains, with moderate noise offering regularization benefits. Overall, JiTTER demonstrates that explicit temporal-order reasoning in SSL improves event boundary localization and transient detail learning, offering a practical, scalable pretraining paradigm for SED and related audio tasks, with code available on GitHub.

Abstract

Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.

JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection

TL;DR

JiTTER tackles the challenge of precise sound event boundary detection by replacing masked block prediction with hierarchical temporal shuffle reconstruction, forcing a transformer to recover the correct sequence at both coarse and fine temporal scales. The method introduces block-level and frame-level shuffles, plus subtle noise, and optimizes a dual objective that combines self-supervised reconstruction with supervised SED fine-tuning. Ablations on the DESED dataset show that jointly leveraging both perturbation levels yields the largest gains, with moderate noise offering regularization benefits. Overall, JiTTER demonstrates that explicit temporal-order reasoning in SSL improves event boundary localization and transient detail learning, offering a practical, scalable pretraining paradigm for SED and related audio tasks, with code available on GitHub.

Abstract

Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.

Paper Structure

This paper contains 30 sections, 11 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Illustration of the hierarchical temporal shuffle strategy in JiTTER, designed to improve temporal modeling for self-supervised SED. (a) Block-Level Shuffle: The input audio sequence is divided into non-overlapping blocks, and a portion of these blocks (in darker grey) is randomly shuffled along the time axis. This disrupts global event dependencies while preserving intra-block structures, forcing the model to reconstruct event sequences at a higher level. (b) Frame-Level Shuffle: A subset of blocks is randomly selected, and within each selected block (in blue and orange), a fraction of frames (in darker blue and orange) is randomly shuffled. This introduces fine-grained perturbations while maintaining the overall event order, helping the model learn transient sound characteristics. Together, these two levels of shuffle perturbations encourage the model to reconstruct the correct temporal order, improving both global event structure comprehension and fine-grained boundary detection in SED.