Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization

Xiachong Feng; Xiaocheng Feng; Xiyuan Du; Min-Yen Kan; Bing Qin

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization

Xiachong Feng, Xiaocheng Feng, Xiyuan Du, Min-Yen Kan, Bing Qin

TL;DR

An adapter-based summarization model where two adapters cooperatively facilitate learning using fewer parameters to reduce communication costs is developed, and a selective knowledge distillation strategy is devised, assisting clients in robustly handling domain-focused modelling on their own data, while leveraging global parameters based on non-IID data.

Abstract

Meeting summarization has emerged as a promising technique for providing users with condensed summaries. However, existing work has focused on training models on centralized data, neglecting real-world scenarios where meeting data are infeasible to collect centrally, due to their sensitive nature. This gap motivates us to explore federated learning for meeting summarization. Two critical challenges impede progress. First, state-of-the-art summarizers are based on parameter-heavy pre-trained models. Exchanging such a model's parameters across clients imposes large bandwidth costs. Second, as real-world meeting data belong to various domains and are distributed across clients, they are instances of non-identically and independently distributed (non-IID). IID assumptions do not hold, which changes which forms of learning algorithms best apply. To address this, we propose Adapter-based Federated Selective Knowledge Distillation (AdaFedSelecKD) for training performant client models. Specifically, we develop an adapter-based summarization model where two adapters cooperatively facilitate learning using fewer parameters to reduce communication costs. Then, we devise a selective knowledge distillation strategy, assisting clients in robustly handling domain-focused modelling on their own data, while leveraging global parameters based on non-IID data. Extensive experiments on the QMSum benchmark demonstrate AdaFedSelecKD can achieve comparable performance with powerful centralized training methods, and shows its generalizability and robustness.

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization

TL;DR

Abstract

Paper Structure (38 sections, 8 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 8 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries
Multi-domain Meeting Summarization Dataset
Task Definition
Federated Learning Framework
Methodology
Overview
Adapter-based Meeting Summarizer
Motivation
Backbone Model
Global-Local Adapters
Selective Knowledge Distillation Strategy
Motivation
Knowledge Distillation
Selective Strategy
...and 23 more sections

Figures (10)

Figure 1: The overall federated learning framework of multi-domain meeting summarization. In the concrete setting for this paper, there is one central server and three clients covering distinct domains: Academic, Committee and Product. Each client uniquely maintains its own domain-specific data.
Figure 2: Illustration of our proposed AdaFedSelectKD learning framework. The overall framework adheres to a client--server learning paradigm. At the bottom, three clients are depicted, where each client adopts the selective knowledge distillation algorithm to optimize its own adapter-based meeting summarizer using its domain-specific private data. Two types of adapters are tailored for the information exchange between the server and clients, including the global adapter and the local adapter. The optimized parameters from three clients are then conveyed to the central server. At the top, the central server employs the federated averaging algorithm to aggregate client information. The resulting new parameters are distributed to the clients for the subsequent learning round.
Figure 3: Illustration of the adapter architecture. Two types of adapters are added between transformer layers, including the global adapter and the local adapter. Both adapters share the same architecture, comprising a down-projection feed-forward layer, a non-linear activation function, an up-projection feed-forward layer and a residual connection module equipped with layer normalization. The global adapter receives parameters from the server and provides global knowledge, whereas the local adapter is co-optimized through training on the local data and distilling knowledge from the global adapter. The updated parameters are then transmitted to the server for the next round of learning.
Figure 4: Generated meeting summary comparison of AdaFedSelectKD with other methods on 60 randomly-chosen meetings. For example, compared with AdaFedSelectKD, AdaFedAvg performs better on 4 of the 60 summaries and worse on 52.
Figure 5: Average ROUGE results based on the IID and balanced data setting, where each client maintains meeting summarization data of the same distribution (IID) and holds the same amount of data instances (balanced).
...and 5 more figures

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization

TL;DR

Abstract

Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)