Table of Contents
Fetching ...

Secure Linear Alignment of Large Language Models

Matt Gorbett, Suman Jana

Abstract

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.

Secure Linear Alignment of Large Language Models

Abstract

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.
Paper Structure (70 sections, 11 equations, 13 figures, 12 tables, 2 algorithms)

This paper contains 70 sections, 11 equations, 13 figures, 12 tables, 2 algorithms.

Figures (13)

  • Figure 1: Text Generation via Cross-Model Linear Alignment: We learn an affine map from Qwen’s hidden states into Llama’s feature space, enabling Qwen representations to be decoded by Llama’s token head. The resulting hybrid model combines Qwen’s encoder/transformer blocks with Llama’s output head, producing coherent responses without adopting either model’s identity.
  • Figure 2: Linear CKA similarity across embedding APIs. We compute linear kornblith2019similarity on vendor-provided embeddings over shared inputs from multiple datasets. CKA values range from 0.595 to 0.881, indicating substantial shared linear structure across independently trained models.
  • Figure 3: Cross-Model Embedding Similarity to Native Models: We compare Cross-Model text generation to the text produced by each base model using cosine similarity (using OpenAI's embedding-001). Each point represents an Alpaca test prompt. High-similarity pairs (upper right) produce coherent text, while low-similarity pairs (lower left) generate incoherent outputs.
  • Figure 4: Exact Token Match Rate Predicts Cross-Model Generation Quality. Exact token match rate between two models predicts cross-model text generation quality across 23 model pairs. Quality is measured by LLM-as-a-Judge Scores ($r=0.898$, $p<0.001$).
  • Figure 5: Two-party privacy-preserving alignment and inference.Training:Party B (client) encrypts embeddings $Z_B = g_B(\mathcal{D}_{\text{pub}})$ and sends $\mathsf{Enc}(Z_B)$ to Party A (provider), who computes the encrypted cross-covariance $\mathsf{Enc}(Z_A^\top Z_B)$ using plaintext $Z_A = g_A(\mathcal{D}_{\text{pub}})$ and returns $\mathsf{Enc}(Z_A^\top Z_B)$ to Party B. Party B decrypts $Z_A^\top Z_B$ and computes $W^*$ locally using Eq. (\ref{['eq:normal_eq']}). Inference:Party B computes aligned embedding $\hat{z}_A = z_B \cdot W^* + b^*$ locally, encrypts $\mathsf{Enc}_I(\hat{z}_A)$, and sends to Party A, who applies the classifier homomorphically and returns the encrypted prediction for Party B to decrypt.
  • ...and 8 more figures