TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA
Chanjoo Jung, Jaehyung Kim
TL;DR
TiTok addresses the challenge of transferring LoRA-based PEFT knowledge across heterogeneous backbones by introducing a token-level contrastive excess signal that identifies informative tokens within synthetic data generated by a source expert. It employs a two-stage filtering pipeline—sample filtering by mean excess and token-level selection by top k%—and uses a tokenizer-alignment mechanism to handle mismatched tokenizers, all without training extra discriminators. Across BBH, MMLU, and LaMP benchmarks, TiTok consistently surpasses Vanilla, KD, and TransLoRA baselines, achieving average gains up to around +8% and demonstrating robustness to cross-family and external-data transfer. The approach offers a practical, data-efficient pathway for deploying LoRA-based knowledge transfer in real-world, multi-model ecosystems, with potential extensions to adaptive token thresholds and broader data sources.
Abstract
Large Language Models (LLMs) are widely applied in real world scenarios, but fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs, but the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data, but this adds complexity because it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, our experiments show that the proposed method is consistently effective, achieving average performance gains of +4~8% compared to baselines overall.
