Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts
Youcheng Huang, Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv
TL;DR
The paper investigates whether concept representations in LLMs, encoded as steering vectors (SVs), are transferable across models. It proposes L-Cross Modulation, a linear transformation framework that learns a transformation matrix $\mathbf{T}$ to map SVs from a source LLM to a target one, enabling cross-model modulation via $\bar{\lambda}_W^{m_t} \approx \bar{\lambda}_W^{m_s} \mathbf{T}$ and scaling by $\beta$. Across eleven concepts and multiple open-source LLMs, it demonstrates that cross-model SV transfer is effective, that the learned transformations generalize across concepts, and that SVs from smaller models can meaningfully steer larger models (weak-to-strong transfer). These results suggest a universal structure in how LLMs encode concepts, with implications for cross-model control and safety without retraining. The study provides a foundation for more adaptable and interoperable LLMs by revealing linear, cross-model correspondences in concept representations.
Abstract
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
