Table of Contents
Fetching ...

ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong

TL;DR

ConTrans presents a low-cost, internal approach to weak-to-strong alignment by harvesting concept vectors from a weakly aligned source LLM, reformulating them to a target model’s latent space via an affine transform, and transplanting them into the target’s residual stream with a steering coefficient. The method is validated through cross-model transfers (e.g., from 7B instruct to 13B and 70B base models) and across model families, achieving improvements in truthfulness and reductions in toxicity without extensive retraining. Key findings include evidence of shared concept features across models, activation of concepts during alignment, and successful transplantation between models of different sizes, though single-concept interventions and cross-domain capability gains remain challenging. Overall, ConTrans demonstrates an efficient pathway for cross-model alignment generalization by cultivating robust concepts in weaker models and transplanting them into stronger ones, with practical implications for scalable alignment engineering.

Abstract

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.

ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

TL;DR

ConTrans presents a low-cost, internal approach to weak-to-strong alignment by harvesting concept vectors from a weakly aligned source LLM, reformulating them to a target model’s latent space via an affine transform, and transplanting them into the target’s residual stream with a steering coefficient. The method is validated through cross-model transfers (e.g., from 7B instruct to 13B and 70B base models) and across model families, achieving improvements in truthfulness and reductions in toxicity without extensive retraining. Key findings include evidence of shared concept features across models, activation of concepts during alignment, and successful transplantation between models of different sizes, though single-concept interventions and cross-domain capability gains remain challenging. Overall, ConTrans demonstrates an efficient pathway for cross-model alignment generalization by cultivating robust concepts in weaker models and transplanting them into stronger ones, with practical implications for scalable alignment engineering.

Abstract

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.
Paper Structure (32 sections, 6 equations, 8 figures, 10 tables)

This paper contains 32 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The diagram of ConTrans that consists of three essential modules: ➀ concept refinement refining and extracting a vector for a given concept with a set of concept-related positive/negative examples from the source LLM $\mathcal{M}^{\text{src}}$ ➁ concept reformulation reshaping and adapting the refined concept vector into the feature space of the target LLM $\mathcal{M}^{\text{tgt}}$ through affine transformation and ➂ concept transplantation transplanting the reformulated concept vector into the residual stream of the target LLM to control the outputs of the target LLM related to the given concept.
  • Figure 2: (a) Emotion prediction accuracy on negative scenarios for each emotion. The bar denotes Token Acc., while the dashed line depicts Logit Acc. (b) The PCA visualization of LLaMA-13B hidden states intervened by $\bm{v}_{\text{fear}}$ from LLaMA-7B.
  • Figure 3: Concept transplantation to different checkpoints of Amber-7B.
  • Figure 4: Visualization of emotion prediction accuracy improvement ratios and absolute improvements due to concept transplantation between Pythia models.
  • Figure 5: Mean Token Acc. changes with the number of sentences.
  • ...and 3 more figures