Table of Contents
Fetching ...

Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

TL;DR

This work addresses cross-scale parametric knowledge transfer for large language models by proposing SemAlign, a semantics-first framework that uses layer activations as the transfer medium. The method consists of three stages: layer attribution and pairing to identify counterpart layers, latent semantic alignment to project teacher semantics into the student space, and cosine-based representation steering to align both intermediate representations and final outputs. Across four benchmarks and multiple LLM pairs, SemAlign consistently outperforms prior PKT baselines and remains closer to larger teachers than alternative transfer methods, with notable gains on MMLU and code-related tasks. The approach reduces neural incompatibility, is computation-efficient, and demonstrates the value of preserving semantic content in latent space for robust cross-scale knowledge transfer.

Abstract

Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.

Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

TL;DR

This work addresses cross-scale parametric knowledge transfer for large language models by proposing SemAlign, a semantics-first framework that uses layer activations as the transfer medium. The method consists of three stages: layer attribution and pairing to identify counterpart layers, latent semantic alignment to project teacher semantics into the student space, and cosine-based representation steering to align both intermediate representations and final outputs. Across four benchmarks and multiple LLM pairs, SemAlign consistently outperforms prior PKT baselines and remains closer to larger teachers than alternative transfer methods, with notable gains on MMLU and code-related tasks. The approach reduces neural incompatibility, is computation-efficient, and demonstrates the value of preserving semantic content in latent space for robust cross-scale knowledge transfer.

Abstract

Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.

Paper Structure

This paper contains 27 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Semantic association of vocabulary and latent space. For each color label on the vocabulary (left), there is a color semantic basis in the latent space (middle). The semantics of the dark dot (indicating an arbitrary representation) in the latent space can be quantified as its cosine similarities to semantic bases. The semantics can be computed as probabilities on the vocabulary. When focusing on the nearest semantic basis for a given latent representation, a latent space can be quantified as discrete semantic regions (right).
  • Figure 2: Empirical Validation of Semantics Decomposition on HumanEval with Llama2 (7B).
  • Figure 3: Illustration of our cross-scale knowledge transfer approach. Assume a 20-layer teacher LM and a 10-layer student LM. First, layers in teacher and student models are pairs by dashed arrow lines. Marked by orange color, the 20th teacher's layer is located as critical, and its pair is the 10th student's layer, namely the layer to optimize. Second, represented by the dots in 3D and 2D spaces, the layer outputs from teacher model are decomposed in the larger dimensional teacher's latent space and recomposed in the smaller dimensional student's latent space, as the supervisory signal. It undergos dimensional reduction but still preserves complete semantics, represented by the changes to gray bots, remaining the body gesture but reducing details. Third, the paired student's layer will be updated, to make the student's layer outputs be close to the supervisory signal. It is similar to blue bots, to be adjusted playing the same body gesture as gray bots does. Afte the cross-scale knowledge transfer, student's layer outputs will steer to the supervisory signal, represented by the dashed curve, and partial layer parameters are optimized, marked by the delta symbol.
  • Figure 4: Comparison of Layer-wise Representation Similarities between LLMs.
  • Figure 5: Comparison of Layer-wise Representation Similarities between LLMs.
  • ...and 1 more figures