Table of Contents
Fetching ...

Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

Youcheng Huang, Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv

TL;DR

The paper investigates whether concept representations in LLMs, encoded as steering vectors (SVs), are transferable across models. It proposes L-Cross Modulation, a linear transformation framework that learns a transformation matrix $\mathbf{T}$ to map SVs from a source LLM to a target one, enabling cross-model modulation via $\bar{\lambda}_W^{m_t} \approx \bar{\lambda}_W^{m_s} \mathbf{T}$ and scaling by $\beta$. Across eleven concepts and multiple open-source LLMs, it demonstrates that cross-model SV transfer is effective, that the learned transformations generalize across concepts, and that SVs from smaller models can meaningfully steer larger models (weak-to-strong transfer). These results suggest a universal structure in how LLMs encode concepts, with implications for cross-model control and safety without retraining. The study provides a foundation for more adaptable and interoperable LLMs by revealing linear, cross-model correspondences in concept representations.

Abstract

Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.

Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

TL;DR

The paper investigates whether concept representations in LLMs, encoded as steering vectors (SVs), are transferable across models. It proposes L-Cross Modulation, a linear transformation framework that learns a transformation matrix to map SVs from a source LLM to a target one, enabling cross-model modulation via and scaling by . Across eleven concepts and multiple open-source LLMs, it demonstrates that cross-model SV transfer is effective, that the learned transformations generalize across concepts, and that SVs from smaller models can meaningfully steer larger models (weak-to-strong transfer). These results suggest a universal structure in how LLMs encode concepts, with implications for cross-model control and safety without retraining. The study provides a foundation for more adaptable and interoperable LLMs by revealing linear, cross-model correspondences in concept representations.

Abstract

Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
Paper Structure (15 sections, 2 equations, 12 figures, 6 tables)

This paper contains 15 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: In Plato’s Allegory of the Cave, prisoners try to comprehend universal reality by their experiences (shadows of reality). In analogy, different LLMs attempt to infer universal concepts by training on their own data. Representing underlying universal concepts, are conceptual representations transferable in different LLMs?
  • Figure 2: L-Cross Modulation uses linear transformations to transform the conceptual represetations of different LLMs, which enables using SVs derived from one LLM to modulate another LLM's output.
  • Figure 3: T-SNE visualization of representations $\{\lambda_\delta\}$. The green, purple, and yellow dots correspond to the concepts of AIC., CORR., and HALLU., respectively.
  • Figure 4: In Self-Modulation, varying $\beta$ results in a maximum 54.0% harmful outputs of Qwen2 0.5B. However, the harmful SV derived from Qwen2 0.5B effectively modulate Qwen2 7B to generate 88.0% harmful outputs.
  • Figure 5: Weak-to-Strong L-Cross Modulation where SVs are extracted from a weak model of Qwen2-0.5B.
  • ...and 7 more figures