Table of Contents
Fetching ...

Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Liu hung ming

Abstract

Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.

Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Abstract

Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.
Paper Structure (95 sections, 26 equations, 6 figures, 3 tables)

This paper contains 95 sections, 26 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Stage 1 pipeline overview. Top row (A): The input pipeline feeds 50 Kinetics-mini videos through the frozen V-JEPA 2 ViT-L encoder ($\nabla\theta = 0$), producing $B \times 1568 \times 1024$ latent token vectors that are precomputed once and cached. Middle row (A): The Stage A quantizer trains for 3,000 steps on the precomputed vectors using a linear projection layer ($1024 \to 256$ dimensions, followed by LayerNorm and L2 normalization), an EMA-updated codebook ($K = 8$, $\gamma = 0.90$), and a dead-code reset mechanism ($\beta = 2.0$ commitment loss); the V-JEPA 2 encoder weights remain fully frozen throughout. Bottom row (B/C): After training, the pipeline is evaluated on two diagnostic criteria: H1 symbol stability (20 repeated forward passes per video; consistency $= 100\%$) and H2 category-contrast experiments (three physical-variable pairs assessed via $\chi^2$, MI, and JSD), with a Gaussian-noise random baseline confirming MI $\approx 0$ for unstructured inputs. All four pass criteria are satisfied, producing the Stage 1 report and AIM dictionary that certify readiness for Stage 2.
  • Figure 2: Stage A quantizer training curves. Left: commitment loss as a function of training iteration, showing rapid convergence within the first 200 steps. Right: codebook perplexity trajectory; the red dashed line marks the health threshold ($0.4 \times K = 3.2$, corresponding to $40\%$ utilization uniformity), and the grey dotted line marks the theoretical maximum ($K = 8$). The converged perplexity of $4.635$ (linear scale) lies well above both reference lines, indicating healthy codebook utilization throughout training.
  • Figure 3: Intervention results for grasp_angle (archery vs. bowling). Left: symbol distribution over the top-20 codebook entries; archery concentrates entirely on entry #5 while bowling shows a secondary mass on entry #4. Centre: pairwise JSD heatmap ($\mathrm{JSD} = 0.19$). Right: symbol sensitivity plot; red bars indicate entries included in the condition--symbol mapping.
  • Figure 4: Intervention results for object_geometry (flying_kite vs. high_jump). Left: flying_kite exhibits a small secondary mass on entry #4 absent in high_jump. Centre: pairwise JSD heatmap ($\mathrm{JSD} = 0.19$). Right: symbol sensitivity plot; entries #4 and #5 are the active mapping symbols.
  • Figure 5: Intervention results for motion_speed (marching vs. archery). Left: marching distributes mass across entries #5, #4, and #3, while archery concentrates entirely on #5, reflecting the contrast between periodic gait and quasi-static release dynamics. Centre: pairwise JSD heatmap ($\mathrm{JSD} = 0.34$), the largest of the three interventions. Right: symbol sensitivity plot; entry #5 shows the highest sensitivity, with #4 and #3 also active (red/blue bars).
  • ...and 1 more figures