Table of Contents
Fetching ...

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client

Gongxi Zhu, Hanlin Gu, Lixin Fan, Qiang Yang, Yuxing Han

TL;DR

FedGRPO reframes server-side foundation-model refinement as a privacy-preserving reward-evaluation process. It combines competence-based expert selection with Group Relative Policy Optimization to aggregate scalar rewards from selected clients, eliminating the need to share data or high-dimensional updates. Empirically, FedGRPO achieves superior downstream accuracy and markedly lower communication overhead compared with FedFMs baselines, and it approaches centralized GRPO performance even when ground-truth answers are unavailable. This approach enables scalable, privacy-conscious collaboration across heterogeneous clients, offering practical impact for deploying Federated Foundation Models in real-world, data-sensitive domains.

Abstract

One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the "Group Relative" concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client

TL;DR

FedGRPO reframes server-side foundation-model refinement as a privacy-preserving reward-evaluation process. It combines competence-based expert selection with Group Relative Policy Optimization to aggregate scalar rewards from selected clients, eliminating the need to share data or high-dimensional updates. Empirically, FedGRPO achieves superior downstream accuracy and markedly lower communication overhead compared with FedFMs baselines, and it approaches centralized GRPO performance even when ground-truth answers are unavailable. This approach enables scalable, privacy-conscious collaboration across heterogeneous clients, offering practical impact for deploying Federated Foundation Models in real-world, data-sensitive domains.

Abstract

One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the "Group Relative" concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.
Paper Structure (37 sections, 11 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 11 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of FedGRPO including three steps: 1) Expert selection to select an appropriate expert subset $\mathcal{C}(x_{s})$ for every question $x_{s}$; 2) Dual evaluation on the select client $k \in \mathcal{C}(x_{s})$ to compute rewards $r_k^s$ for the policy $\hat{y}$; and 3) Group relative reward aggregation on server to get group-relative reward $R_{k}$ and perform policy optimization to update LLMs.
  • Figure 2: The communication overheaf of FedGRPO, Fedpetuning and DPSDA-FL.
  • Figure 3: Accuracy of FedGRPO on AMC, Olympiad, and all 6 testsets (averaged) across varying client numbers.
  • Figure 4: Performance of FedGRPO on AMC, Olympiad and averaged accuracy on 6 testsets including MATH500, Minerva, OlympiadBench, AIME 2024, AIME 2025, and AMC with different selected expert numbers.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2