"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

Zi Liang; Qingqing Ye; Yanyun Wang; Sen Zhang; Yaxin Xiao; Ronghua Li; Jianliang Xu; Haibo Hu

"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

Zi Liang, Qingqing Ye, Yanyun Wang, Sen Zhang, Yaxin Xiao, Ronghua Li, Jianliang Xu, Haibo Hu

TL;DR

This paper tackles the risk of extracting commercial LLM capabilities by showing that traditional MLE-based model extraction is misaligned with how modern LLMs are trained via RLHF. It proposes Locality Reinforced Distillation (LoRD), a policy-gradient–style extraction that leverages locality between a local model and the victim’s generations to provide a signal for learning, with a PPO-like regularization to stabilize training. The authors prove that LoRD’s learning trajectory is consistent with LLM alignments and demonstrate improved query efficiency and watermark resistance, validated across domain-specific knowledge tasks and safety-alignment scenarios. Experiments show that a relatively small local model can closely mimic a 175B victim on many tasks, underscoring significant practical risks and motivating defensive considerations for LLM providers.

Abstract

Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA .

"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

TL;DR

Abstract

Paper Structure (33 sections, 2 theorems, 23 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 23 equations, 11 figures, 9 tables, 1 algorithm.

Introduction
Background
Policy Gradient Models
Language Modeling
LoRD: Locality Reinforced Distillation
Overview
Design of Loss Functions
Theoretical Analysis
Consistency Analysis on Learning Tasks
Comparative Analysis on Model Stealing
Experiments
Settings
Stealing Domain-Specific Knowledge
Stealing Safety Alignments
Conclusion
...and 18 more sections

Key Result

Proposition 1

The learning procedure for LLMs' alignments is consistent with the stealing procedure of LoRD, i.e., they both attempt to maximize the difference between the probabilities of positive and negative samples. Conversely, they are inconsistent with either MLE or KD. In MLE, the objective is maximizing t

Figures (11)

Figure 1: Comparison between vanilla MEAs on conventional DNNs (left) and MEAs on LLMs with alignments (right).
Figure 2: The stealing procedure of LoRD.
Figure 3: Determination of the positive and negative samples in LoRD. We sample $\mathbf{y}_{t-1}^{+}$ and $\mathbf{y}_{t-1}^{-}$ from $P_{\theta_{t-1}}(\cdot |\mathbf{x})$, and compute their conditional probabilities. The response with a higher probability increment on $\theta_{t}$ is selected as the positive sample.
Figure 4: Illustrations for the converging procedure of probability distributions regarding four methods, namely MLE (a), KD (b), RLHF (c), and LoRD (d). Arrows indicate the expected optimization direction. We mark the distribution dimensions learned with labels in blue, and employ pink and yellow components to indicate the probabilities of positive and negative tokens, respectively.
Figure 5: Spectrum of the fidelity and performance-up on extracting different downstream tasks.
...and 6 more figures

Theorems & Definitions (2)

Proposition 1: Consistency in Stealing Procedure
Proposition 2: Equivalence when Converged

"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

TL;DR

Abstract

"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)