Table of Contents
Fetching ...

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen

TL;DR

This paper addresses the security of decentralised GRPO used for post-training LLMs, showing that adversaries can poison benign nodes by injecting malicious tokens into completions exchanged during all-gather, achieving up to ASR = 100% in as few as 50 iterations. It formalises two decentralised regimes (vertical and horizontal) and two model settings (homogeneous and heterogeneous), and demonstrates two domain-specific attack families (out-of-context and in-context) on math and coding tasks, including a 2+2=5 equation manipulation and a code-injection scenario. The authors propose two defenses—homogeneous token-generation checks and heterogeneous LLM-as-a-judge—that can deter attacks under certain conditions, though each introduces trade-offs in detection coverage and learning efficiency. The work highlights critical security risks in low-communication decentralised RL and provides a foundation for future robust defence strategies, including enhanced judge-based verification and subliminal learning to mitigate covert poisoning effects.

Abstract

Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

TL;DR

This paper addresses the security of decentralised GRPO used for post-training LLMs, showing that adversaries can poison benign nodes by injecting malicious tokens into completions exchanged during all-gather, achieving up to ASR = 100% in as few as 50 iterations. It formalises two decentralised regimes (vertical and horizontal) and two model settings (homogeneous and heterogeneous), and demonstrates two domain-specific attack families (out-of-context and in-context) on math and coding tasks, including a 2+2=5 equation manipulation and a code-injection scenario. The authors propose two defenses—homogeneous token-generation checks and heterogeneous LLM-as-a-judge—that can deter attacks under certain conditions, though each introduces trade-offs in detection coverage and learning efficiency. The work highlights critical security risks in low-communication decentralised RL and provides a foundation for future robust defence strategies, including enhanced judge-based verification and subliminal learning to mitigate covert poisoning effects.

Abstract

Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.

Paper Structure

This paper contains 38 sections, 2 equations, 15 figures, 1 table, 2 algorithms.

Figures (15)

  • Figure 1: Out-of-context attack results in dRL settings on QWEN-2.5 1.5B models trained on the GSM8k dataset. Both settings include 25% malicious users. Attack Success Rate (ASR) measures the ratio of completions from the honest workers containing the malicious text on a validation dataset.
  • Figure 2: Example of a poisoned completion produced by an honest model.
  • Figure 3: An example of a poisoned completion produced by a benign model for an equation manipulation attack.
  • Figure 4: In-context equation manipulation attack results with 25% malicious participation.
  • Figure 5: ASR in horizontal Hail to the thief (HTTF) and vertical 2+2=5 with the LLM-as-a-judge defense.
  • ...and 10 more figures