Table of Contents
Fetching ...

Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI

Eunchung Noh, Jeonghun Baek

TL;DR

This paper tackles safety in federated fine-tuning of large language models (FedLLM) by addressing the risk that client data containing harmful content can produce unsafe global models. It introduces two responsible-AI methods—a server-side safety filter (LG3) and Constitutional AI (CAI) with a cost-efficient training regime—to mitigate risks during FedLLM; these are designed to work within FedLLM's privacy-preserving, parameter-efficient update framework using LoRA. Empirical results on AdvBench, HHH, and MT-Bench show safety improvements exceeding 20% and enhanced helpfulness, with the CAI approach providing substantial gains and the safety filter reducing unsafe data at the source. This work establishes a foundation for responsible FedLLM and points toward future multimodal extensions, balancing safety with computational practicality through a cost-efficient CAI strategy.

Abstract

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.

Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI

TL;DR

This paper tackles safety in federated fine-tuning of large language models (FedLLM) by addressing the risk that client data containing harmful content can produce unsafe global models. It introduces two responsible-AI methods—a server-side safety filter (LG3) and Constitutional AI (CAI) with a cost-efficient training regime—to mitigate risks during FedLLM; these are designed to work within FedLLM's privacy-preserving, parameter-efficient update framework using LoRA. Empirical results on AdvBench, HHH, and MT-Bench show safety improvements exceeding 20% and enhanced helpfulness, with the CAI approach providing substantial gains and the safety filter reducing unsafe data at the source. This work establishes a foundation for responsible FedLLM and points toward future multimodal extensions, balancing safety with computational practicality through a cost-efficient CAI strategy.

Abstract

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.

Paper Structure

This paper contains 14 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: For the responsible federated large language model (FedLLM), we apply a safety filter and Constitutional AI (CAI) to improve safety.
  • Figure 2: Overview of FedLLM. $W$ represents the model weights. The pretrained LLM ($W_P$) remains frozen, and only the local LoRA weights ($W_L$) are finetuned.
  • Figure 3: For CAI, red and self-revised responses are collected over three conversation turns.