BRIDLE: Generalized Self-supervised Learning with Quantization
Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu
TL;DR
BRIDLE tackles the limitations of single-codebook quantization in bidirectional self-supervised learning by integrating residual quantization with an interleaved encoder/tokenizer pretraining loop across audio, image, and video. By employing a hierarchy of codebooks, BRIDLE achieves finer latent discretization and improves downstream representations, attaining state‑of‑the‑art results on AudioSet and competitive performance on ImageNet‑1K and Kinetics-400. The approach introduces four components (E,T,D,TE) and loss terms L_cb and L_cos, along with EMA-based codebook updates and initialization strategies that enhance code usage. Overall, BRIDLE advances cross-modal SSL by improving representation quality, codebook utilization, and generalization for downstream tasks, with promising directions for joint training and adaptive quantization.
Abstract
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
