Table of Contents
Fetching ...

FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

Yuanqin He, Yan Kang, Lixin Fan, Qiang Yang

TL;DR

This work tackles the privacy-sensitive problem of evaluating LLMs within Federated Learning by proposing FedEval-LLM, a framework that relies on personalized evaluation models held by participants as referees and a collective voting mechanism. It introduces bootstrapped, task-specific evaluation data and three data-quality criteria to train these personalized evaluators, addressing the limitations of labeled test sets and external evaluators. Experimental results on eight-client FL with LLaMA-7B demonstrate substantial gains in evaluation capability (r_v and Acc_v) and strong alignment with human preferences and RougeL when using multiple referees, highlighting the importance of task-specific domain knowledge. The framework obviates reliance on external services, enhances privacy, and provides robust, task-aligned evaluation suitable for monitoring global and local model performance in collaborative LLM development.

Abstract

Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.

FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

TL;DR

This work tackles the privacy-sensitive problem of evaluating LLMs within Federated Learning by proposing FedEval-LLM, a framework that relies on personalized evaluation models held by participants as referees and a collective voting mechanism. It introduces bootstrapped, task-specific evaluation data and three data-quality criteria to train these personalized evaluators, addressing the limitations of labeled test sets and external evaluators. Experimental results on eight-client FL with LLaMA-7B demonstrate substantial gains in evaluation capability (r_v and Acc_v) and strong alignment with human preferences and RougeL when using multiple referees, highlighting the importance of task-specific domain knowledge. The framework obviates reliance on external services, enhances privacy, and provides robust, task-aligned evaluation suitable for monitoring global and local model performance in collaborative LLM development.

Abstract

Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
Paper Structure (19 sections, 8 equations, 1 figure, 2 tables)

This paper contains 19 sections, 8 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of the proposed FedEval-LLM framework. It presents two key steps: (1) Training of personalized evaluation models (left) and (2) Collective evaluation using these evaluation models (right). Left: With well-trained local models, $M^i_{local}$, participating clients build a task-specific evaluation dataset $D_{eval}$ based on a question-only dataset $X_{test}$ utilizing a bootstrapping strategy. The obtained evaluation dataset serves as an approximation of the task-specific evaluation criteria $\mathcal{E}_T$, and is used to train a personalized evaluation model for each client, $M^i_{eval}$. Here, $(q, a)$ and $e$ represent the question-answer pair and the corresponding evaluation. Right: In the training phase, a group of clients acts collectively as referees, providing a reliable evaluation of the global model, $M^G$, as the win rate to the reference models, $M^{ref}$. A detailed description of the framework is given in Section 3.