SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures
Max Hartman, Lav Varshney
TL;DR
SparseJEPA addresses the interpretability bottleneck of dense JEPA embeddings by introducing a sparsity penalty that enforces group-based latent structure, guided by the oi-VAE framework. The method combines a lightweight Vision Transformer backbone with a sparsity module that promotes latent variable grouping, and provides a theoretical foundation showing reduced multiinformation via grouping, invoked through the data processing inequality. Empirically, SparseJEPA trained on CIFAR-100 yields improved linear-probe transfer across multiple tasks compared to JEPA, while maintaining predictive performance. This work advances self-supervised learning by delivering more interpretable, efficient representations and offers a pathway toward object-centric, structured latent spaces with broader applicability in vision systems.
Abstract
Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.
