Table of Contents
Fetching ...

Rethinking Parameter Sharing as Graph Coloring for Structured Compression

Boyang Zhang, Daning Cheng, Yunquan Zhang

TL;DR

The paper addresses the memory bottleneck of large neural models and the limitations of heuristic, adjacent-layer parameter sharing. It introduces Geo-Sharing, a symmetry- and graph-coloring-based framework that represents cross-layer sharing via a coloring function $\alpha: L \rightarrow C$ and selects sharing groups using a second-order geometric criterion that aligns perturbations with the Hessian's low-curvature subspace. By decomposing weights with shared bases $B_b=(U_b,V_b)$ and solving a curvature-aligned optimization under a trust-region constraint, the method yields scalable, training-free sharing configurations. Across vision and language benchmarks, Geo-Sharing achieves superior compression–accuracy trade-offs, with substantial inference-time efficiency gains and strong scalability to very large models.

Abstract

Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function $α:L\rightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.

Rethinking Parameter Sharing as Graph Coloring for Structured Compression

TL;DR

The paper addresses the memory bottleneck of large neural models and the limitations of heuristic, adjacent-layer parameter sharing. It introduces Geo-Sharing, a symmetry- and graph-coloring-based framework that represents cross-layer sharing via a coloring function and selects sharing groups using a second-order geometric criterion that aligns perturbations with the Hessian's low-curvature subspace. By decomposing weights with shared bases and solving a curvature-aligned optimization under a trust-region constraint, the method yields scalable, training-free sharing configurations. Across vision and language benchmarks, Geo-Sharing achieves superior compression–accuracy trade-offs, with substantial inference-time efficiency gains and strong scalability to very large models.

Abstract

Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.

Paper Structure

This paper contains 14 sections, 17 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Existing methods are heuristic-based and limited to adjacent 2 layers sharing, while our method is theoretically guided to achieve automatic cross-layer sharing and uses fewer basis.
  • Figure 2: Geo-Sharing: The original hierarchical parameter structure is remodeled as a graph, and cross-layer isotropic relationships are achieved through graph coloring. The coloring function $\alpha_{\text{layer}}$ is based on second-order geometric derivation, minimizing the loss growth on the shared error principal axis in the low curvature direction of the Hessian. The right-hand figure shows the alignment effect between the target loss terrain and the coloring rules in the Hessian analysis.
  • Figure 3: Ablations. (a) As the number of minor axes increases, perplexity consistently decreases, though computational burden increases. (b) When the amplitude factor increases, excessive perturbation leads to a sharp surge in perplexity.
  • Figure 4: (a) Comparison of the number of basis in our method (32 layers represented by only 12 basis) with existing methods wang2024basis. (b) Specific coloring scheme $\alpha_{\text{layer}}$ of our method when compressing LLaMA 7B by 50% (same color indicates shared basis).
  • Figure 5: Comparison of L2 error caused by different coloring functions (top: adjacent, bottom: Geo-sharing), including the difference between the weights after sharing in layers 7-10 and the weights of the original standard model.