Table of Contents
Fetching ...

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

TL;DR

This paper tackles representation collapse in Joint Embedding Predictive Architectures for self-supervised speech by introducing GMM-Anchored JEPA. A one-time GMM fit on log-mel features provides soft posterior targets that are frozen during training, combined with a decaying cluster supervision that gradually yields to the JEPA objective. The results show substantial improvements across ASR, emotion recognition, and slot filling compared to a WavLM-style baseline with matched compute, along with near-maximal cluster entropy, indicating more uniform use of the latent space. Ablation confirms the necessity of a residual anchoring term to prevent drift, suggesting that soft, frozen acoustic anchors offer robust grounding for JEPA-based speech representations. Overall, this work demonstrates that simple, external soft clustering can stabilize self-supervised speech learning and reduce reliance on expensive iterative re-clustering pipelines.

Abstract

Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.

Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

TL;DR

This paper tackles representation collapse in Joint Embedding Predictive Architectures for self-supervised speech by introducing GMM-Anchored JEPA. A one-time GMM fit on log-mel features provides soft posterior targets that are frozen during training, combined with a decaying cluster supervision that gradually yields to the JEPA objective. The results show substantial improvements across ASR, emotion recognition, and slot filling compared to a WavLM-style baseline with matched compute, along with near-maximal cluster entropy, indicating more uniform use of the latent space. Ablation confirms the necessity of a residual anchoring term to prevent drift, suggesting that soft, frozen acoustic anchors offer robust grounding for JEPA-based speech representations. Overall, this work demonstrates that simple, external soft clustering can stabilize self-supervised speech learning and reduce reliance on expensive iterative re-clustering pipelines.

Abstract

Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.
Paper Structure (63 sections, 23 equations, 5 figures, 12 tables, 2 algorithms)

This paper contains 63 sections, 23 equations, 5 figures, 12 tables, 2 algorithms.

Figures (5)

  • Figure 1: GMM-Anchored JEPA: one-time clustering replaces iterative re-training.Phase 1: A GMM is fitted once on log-mel features. Phase 2: The encoder trains with two objectives: predicting masked EMA teacher latents (JEPA loss) and matching frozen GMM posteriors (cluster loss). $\lambda(t)$ decays from 1.0 to 0.01.
  • Figure 2: GMM-JEPA learns well-separated clusters; baselines collapse or overlap. UMAP of frame-level embeddings colored by predicted cluster. (a) Pure JEPA collapses. (b) WavLM-style overlaps. (c,d) GMM-JEPA variants form distinct regions.
  • Figure 3: GMM-JEPA uses all clusters uniformly; baselines collapse to few. Frame counts per cluster (log scale, sorted by rank). Flat = high entropy. Baselines drop steeply; GMM-JEPA stays flat.
  • Figure 4: GMM-JEPA variants maintain stable, high-confidence clusters over time. Top: cluster ID per frame. Bottom: confidence (1 $-$ normalized entropy). (a) Pure JEPA flickers rapidly with near-zero confidence, indicating degenerate representations. (b) WavLM-style shows moderate confidence (0.4--0.6) with frequent cluster switching. (c) GMM-JEPA-T maintains high confidence (0.7--0.9) with moderate cluster and sparse transitions. (d) GMM-JEPA also shows moderate clusters with temporally coherent spans and moderate-to-high confidence (0.5--0.8).
  • Figure 5: Removing residual GMM supervision ($\lambda_{\text{end}}=0$) causes collapse. Without ongoing anchoring, representations deteriorate. (a) UMAP shows overlapping clusters. (b) Only 506/1024 clusters used. (c) Confidence drops, flickering increases.