Table of Contents
Fetching ...

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Yucong Zhang, Juan Liu, Ming Li

TL;DR

A novel foundation model ECHO is proposed that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations, and incorporates sliding patches to support inputs of variable length without padding or cropping.

Abstract

Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

TL;DR

A novel foundation model ECHO is proposed that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations, and incorporates sliding patches to support inputs of variable length without padding or cropping.

Abstract

Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Feature extraction pipeline of the ECHO framework. F: number of frequency bins after STFT; T: number of time frames after STFT; N: number of sub-bands after band splitting; W: band width of sub-bands; P: number of sliding patches; D: feature dimension for each patch.