Table of Contents
Fetching ...

TeleLoRA: Teleporting Model-Specific Alignment Across LLMs

Xiao Lin, Manoj Acharya, Anirban Roy, Susmit Jha

TL;DR

Trojan mitigation in LLMs is hampered by model-specific alignment data requirements. TeleLoRA presents a permutation-symmetric, memory-efficient framework that learns a unified generator of LoRA adapters from local activations across multiple LLMs, enabling zero-shot alignment on unseen models. The approach supports cross-model transfer by sharing adapters across transformer layers and leveraging gradient checkpointing, achieving strong Trojan and jailbreak mitigation with minimal impact on benign performance. Empirical results on TrojAI and jailbreak benchmarks demonstrate improved attack resistance and cross-model generalization, highlighting practical benefits for deploying aligned LLMs across diverse architectures.

Abstract

Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.

TeleLoRA: Teleporting Model-Specific Alignment Across LLMs

TL;DR

Trojan mitigation in LLMs is hampered by model-specific alignment data requirements. TeleLoRA presents a permutation-symmetric, memory-efficient framework that learns a unified generator of LoRA adapters from local activations across multiple LLMs, enabling zero-shot alignment on unseen models. The approach supports cross-model transfer by sharing adapters across transformer layers and leveraging gradient checkpointing, achieving strong Trojan and jailbreak mitigation with minimal impact on benign performance. Empirical results on TrojAI and jailbreak benchmarks demonstrate improved attack resistance and cross-model generalization, highlighting practical benefits for deploying aligned LLMs across diverse architectures.

Abstract

Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.

Paper Structure

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: For model-specific alignment where different LLMs require different alignment supervision, TeleLoRA enables synergy over seen LLMs and zero-shot alignment on unseen LLMs by learning a unified generator of LoRA adapter weights across different LLMs. In contrast, model-specific adapters could not be learned on LLMs without alignment supervision. Model agnostic adapters learned on alignment supervision from other LLMs may not fit the current LLM.
  • Figure 2: A TeleLoRA module on a linear layer uses local activations under reference inputs to predict weights of a multiplicative LoRA adapter for alignment. The network is invariant to the reference inputs (N), LoRA dimensions (R) and equivariant to neurons (H).