Table of Contents
Fetching ...

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Yongzhong Xu

TL;DR

This work extends the geometric analysis of grokking from single-task to multi-task learning by training small Transformer trunks on dual-task and tri-task modular arithmetic and exploring a systematic weight-decay regime. It reveals a cohesive dynamical picture: grokking unfolds along a low-dimensional execution manifold, with a staggered generalization hierarchy (mul → sq → add), empirical integrability, and a phase diagram where weight decay modulates timescales, curvature depth, and reconstruction thresholds. The results show final, generalizing solutions occupy a tiny subspace (k* ≈ 4–8) yet rely on full-rank parameterizations and are fragile to small transverse perturbations, supporting a holographic incompressibility view. Multi-task learning thus constructs a compact, overlapping superposition of algorithmic directions in parameter space, where regularization tightens the subspace and excess capacity provides redundant pathways to escape memorization saddles. Together, these findings illuminate how overparameterization and carefully regulated regularization shape robust generalization through geometry-dictated, task-specific separation in parameter space.

Abstract

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

TL;DR

This work extends the geometric analysis of grokking from single-task to multi-task learning by training small Transformer trunks on dual-task and tri-task modular arithmetic and exploring a systematic weight-decay regime. It reveals a cohesive dynamical picture: grokking unfolds along a low-dimensional execution manifold, with a staggered generalization hierarchy (mul → sq → add), empirical integrability, and a phase diagram where weight decay modulates timescales, curvature depth, and reconstruction thresholds. The results show final, generalizing solutions occupy a tiny subspace (k* ≈ 4–8) yet rely on full-rank parameterizations and are fragile to small transverse perturbations, supporting a holographic incompressibility view. Multi-task learning thus constructs a compact, overlapping superposition of algorithmic directions in parameter space, where regularization tightens the subspace and excess capacity provides redundant pathways to escape memorization saddles. Together, these findings illuminate how overparameterization and carefully regulated regularization shape robust generalization through geometry-dictated, task-specific separation in parameter space.

Abstract

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.
Paper Structure (78 sections, 5 equations, 31 figures, 15 tables)

This paper contains 78 sections, 5 equations, 31 figures, 15 tables.

Figures (31)

  • Figure 1: Multi-task grokking dynamics. (a) Dual-task: multiplication leads addition. (b) Tri-task: a three-way staggered ordering emerges.
  • Figure 2: PC1% decreases with task count. (a) Dual-task: 55--77%. (b) Tri-task: 49--56%. The manifold is no longer rank-1 but remains strongly low-dimensional.
  • Figure 3: PC1% declines over training in multi-task settings, unlike single-task grokking where concentration increases.
  • Figure 4: Grok (WD=1.0) vs. no-WD (WD=0.0) eigenspectra for tri-task (seed 42). No-WD has higher PC1%, consistent with a simpler memorization trajectory.
  • Figure 5: Task-specific head weights are nearly orthogonal. The shared trunk learns a representation where task readouts are geometrically separated.
  • ...and 26 more figures