Table of Contents
Fetching ...

Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

Candy Olivia Mawalim, Haotian Zhang, Shogo Okada

TL;DR

This work addresses environmental sound deepfake detection under two challenging scenarios using EnvSDD. It introduces a text-guided cross-attention architecture that fuses audio features with captions generated by audio-captioning models, augmented by BEATs and AASIST backbones, and complements it with a stacked ensemble leveraging RoBERTa-based text features. The ATCA approach shows competitive EER improvements, particularly in the low-resource Track 2 setting, while the ensemble yields further gains at increased compute. Overall, the study demonstrates the value of cross-modal semantic guidance and ensembling for robust ESDD in unseen and data-scarce environments.

Abstract

This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).

Technical Report of Nomi Team in the Environmental Sound Deepfake Detection Challenge 2026

TL;DR

This work addresses environmental sound deepfake detection under two challenging scenarios using EnvSDD. It introduces a text-guided cross-attention architecture that fuses audio features with captions generated by audio-captioning models, augmented by BEATs and AASIST backbones, and complements it with a stacked ensemble leveraging RoBERTa-based text features. The ATCA approach shows competitive EER improvements, particularly in the low-resource Track 2 setting, while the ensemble yields further gains at increased compute. Overall, the study demonstrates the value of cross-modal semantic guidance and ensembling for robust ESDD in unseen and data-scarce environments.

Abstract

This paper presents our work for the ICASSP 2026 Environmental Sound Deepfake Detection (ESDD) Challenge. The challenge is based on the large-scale EnvSDD dataset that consists of various synthetic environmental sounds. We focus on addressing the complexities of unseen generators and low-resource black-box scenarios by proposing an audio-text cross-attention model. Experiments with individual and combined text-audio models demonstrate competitive EER improvements over the challenge baseline (BEATs+AASIST model).

Paper Structure

This paper contains 10 sections, 2 tables.