TeleLoRA: Teleporting Model-Specific Alignment Across LLMs
Xiao Lin, Manoj Acharya, Anirban Roy, Susmit Jha
TL;DR
Trojan mitigation in LLMs is hampered by model-specific alignment data requirements. TeleLoRA presents a permutation-symmetric, memory-efficient framework that learns a unified generator of LoRA adapters from local activations across multiple LLMs, enabling zero-shot alignment on unseen models. The approach supports cross-model transfer by sharing adapters across transformer layers and leveraging gradient checkpointing, achieving strong Trojan and jailbreak mitigation with minimal impact on benign performance. Empirical results on TrojAI and jailbreak benchmarks demonstrate improved attack resistance and cross-model generalization, highlighting practical benefits for deploying aligned LLMs across diverse architectures.
Abstract
Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.
