ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong
TL;DR
ConTrans presents a low-cost, internal approach to weak-to-strong alignment by harvesting concept vectors from a weakly aligned source LLM, reformulating them to a target model’s latent space via an affine transform, and transplanting them into the target’s residual stream with a steering coefficient. The method is validated through cross-model transfers (e.g., from 7B instruct to 13B and 70B base models) and across model families, achieving improvements in truthfulness and reductions in toxicity without extensive retraining. Key findings include evidence of shared concept features across models, activation of concepts during alignment, and successful transplantation between models of different sizes, though single-concept interventions and cross-domain capability gains remain challenging. Overall, ConTrans demonstrates an efficient pathway for cross-model alignment generalization by cultivating robust concepts in weaker models and transplanting them into stronger ones, with practical implications for scalable alignment engineering.
Abstract
Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.
