Table of Contents
Fetching ...

Social-JEPA: Emergent Geometric Isomorphism

Haoran Zhang, Youjin Wang, Yi Duan, Rong Fu, Dianyu Zhao, Sicheng Fan, Shuaishuai Cao, Wentao Guo, Xiao Zhou

TL;DR

The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems.

Abstract

World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.

Social-JEPA: Emergent Geometric Isomorphism

TL;DR

The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems.

Abstract

World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.
Paper Structure (89 sections, 2 theorems, 21 equations, 9 figures, 10 tables, 2 algorithms)

This paper contains 89 sections, 2 theorems, 21 equations, 9 figures, 10 tables, 2 algorithms.

Key Result

Lemma 4.4

If a pair $(f^\star,p^\star)$ achieves zero JEPA loss (i.e., $p^\star(f^\star(x_c)) = f^\star(x_t)$ almost surely), then for any invertible matrix $A \in GL(d)$, the transformed pair also achieves zero loss.

Figures (9)

  • Figure 1: The JEPA framework. The model predicts the representation of a target signal from a context signal using a predictor network, with the loss computed in latent space.
  • Figure 2: Comparison of World Model Training Paradigms. Left: MAE/AE relies on a reconstruction loss $\mathcal{L}_{Recon}$ to recover input pixels. Middle: SimCLR/Contrastive uses data augmentations and an InfoNCE loss to learn view-invariant features. Right: Social-JEPA (Ours) allows separate agents to learn world models from disparate observations; they converge to isomorphic latent spaces via the JEPA objective $\mathcal{L}_{JEPA}$ without sharing raw data.
  • Figure 3: Overview of Social-JEPA. Takeaway: independently trained world models exposed to different observation functions can converge to isomorphic latent structures. How to read: a post hoc linear map $W$ serves as a compact translation layer, enabling plug-and-play probe transfer and representation migration without sharing raw observations.
  • Figure 4: Independent agents can learn world models from different observations and align their latent spaces using linear maps ($W_{ij}$), enabling coordination without sharing raw observations.
  • Figure 5: Isomorphism vs. pair budget, with conditioning of $W$.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Lemma 4.4: GL$(d)$ symmetry at zero loss
  • proof
  • Proposition 4.5: Near-invariance at small loss
  • proof
  • Remark 4.6: Why this matters