JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun
TL;DR
The paper tackles the challenge of learning robust, language-model-friendly speech representations without labeled data by proposing a two-stage framework that decouples representation learning from waveform reconstruction. It combines Joint-Embedding Predictive Architecture (JEPA) with Density Adaptive Attention Mechanisms (DAAM) to yield adaptive temporal feature selection and hierarchical speech structure at 2.5 Hz. In Stage 2, the latent representations are discretized with Finite Scalar Quantization (FSQ) and packed using a mixed-radix scheme to produce 47.5 tokens per second, which can be decoded back to waveform via a HiFi-GAN decoder. The work demonstrates faster convergence and competitive efficiency against neural codecs, providing a reversible, compact tokenization suitable for training downstream language models and other sequence models on speech data.
Abstract
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
