Table of Contents
Fetching ...

LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

Wenrui Yu, Yiyi Chen, Johannes Bjerva, Sokol Kosta, Qiongxiu Li

TL;DR

This paper addresses the privacy risk of embedding inversion in multilingual NLP by introducing LAGO, a language similarity-aware graph optimization framework for few-shot cross-lingual inversion. It builds a topological graph over languages and enforces cross-language consistency through two optimization variants: hard linear inequality constraints and soft total variation penalties, with ALGEN recovered as a special case. Empirical results across multiple languages and victim models show that leveraging language similarity improves transferability by about 10–20% in Rouge-L scores, especially in extremely low-data regimes, and demonstrate robustness to the choice of similarity metric. The work highlights the need for privacy defenses that account for linguistic structure in multilingual embeddings and discusses differential privacy as a defense, noting a significant utility cost in cross-lingual settings.

Abstract

We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.

LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

TL;DR

This paper addresses the privacy risk of embedding inversion in multilingual NLP by introducing LAGO, a language similarity-aware graph optimization framework for few-shot cross-lingual inversion. It builds a topological graph over languages and enforces cross-language consistency through two optimization variants: hard linear inequality constraints and soft total variation penalties, with ALGEN recovered as a special case. Empirical results across multiple languages and victim models show that leveraging language similarity improves transferability by about 10–20% in Rouge-L scores, especially in extremely low-data regimes, and demonstrate robustness to the choice of similarity metric. The work highlights the need for privacy defenses that account for linguistic structure in multilingual embeddings and discusses differential privacy as a defense, noting a significant utility cost in cross-lingual settings.

Abstract

We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.

Paper Structure

This paper contains 33 sections, 16 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Few-shot Cross-lingual Textual Embedding Inversion Leveraging Language Similarities. Example: Attack model trained on English embeddings is used to attack embeddings in other languages, using language similarities as a prior.
  • Figure 2: Illustration of LAGO vs. ALGEN chen2025algen. Top: ALGEN treats each language independently. Bottom: LAGO leverages language similarity by introducing edge constraints in a joint distributed optimization framework.
  • Figure 3: Example graphs using two Language Similarities: (a) AJSP model with $r=0.9$; (b) Lang2vec model with $r=0.45$.
  • Figure 4: Cross-lingual Inversion Performances with AJSP Graph in Cosine Similarities across Training Samples.
  • Figure 5: Cross-lingual Inversion Performances with AJSP Graph in Rouge-L Scores across Training Samples.
  • ...and 11 more figures