Flexible and Adaptable Summarization via Expertise Separation

Xiuying Chen; Mingzhe Li; Shen Gao; Xin Cheng; Qingqing Zhu; Rui Yan; Xin Gao; Xiangliang Zhang

Flexible and Adaptable Summarization via Expertise Separation

Xiuying Chen, Mingzhe Li, Shen Gao, Xin Cheng, Qingqing Zhu, Rui Yan, Xin Gao, Xiangliang Zhang

TL;DR

MoeSumm addresses the need for a single, parameter-efficient model capable of flexible in-domain summarization and adaptable out-of-domain performance. It introduces a Mixture-of-Experts architecture with a shared general summarization main expert and dataset-aware deputy experts, augmented by a max-margin loss to ensure business-like separation of general and specialized abilities. The approach yields strong improvements over baselines across 11 datasets, with notable zero-shot and few-shot adaptability and competitive performance compared to GPT-3.5, while remaining parameter-efficient. The work demonstrates that carefully separating general and domain-specific summarization skills enables rapid adaptation to new domains with limited data and consistent performance across diverse tasks, offering practical impact for cross-domain summarization systems.

Abstract

A proficient summarization model should exhibit both flexibility -- the capacity to handle a range of in-domain summarization tasks, and adaptability -- the competence to acquire new knowledge and adjust to unseen out-of-domain tasks. Unlike large language models (LLMs) that achieve this through parameter scaling, we propose a more parameter-efficient approach in this study. Our motivation rests on the principle that the general summarization ability to capture salient information can be shared across different tasks, while the domain-specific summarization abilities need to be distinct and tailored. Concretely, we propose MoeSumm, a Mixture-of-Expert Summarization architecture, which utilizes a main expert for gaining the general summarization capability and deputy experts that selectively collaborate to meet specific summarization task requirements. We further propose a max-margin loss to stimulate the separation of these abilities. Our model's distinct separation of general and domain-specific summarization abilities grants it with notable flexibility and adaptability, all while maintaining parameter efficiency. MoeSumm achieves flexibility by managing summarization across multiple domains with a single model, utilizing a shared main expert and selected deputy experts. It exhibits adaptability by tailoring deputy experts to cater to out-of-domain few-shot and zero-shot scenarios. Experimental results on 11 datasets show the superiority of our model compared with recent baselines and LLMs. We also provide statistical and visual evidence of the distinct separation of the two abilities in MoeSumm (https://github.com/iriscxy/MoE_Summ).

Flexible and Adaptable Summarization via Expertise Separation

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 7 figures, 6 tables)

This paper contains 19 sections, 9 equations, 7 figures, 6 tables.

Introduction
Related Work
Background
The Proposed MoeSumm Model
Dataset-aware Expert Selector
Max-margin Loss
Adaptability of MoeSumm
Experiments
Dataset and Evaluation setting
Baselines
Implementation Details
Evaluation Metrics
Main Experimental Results
Comparison with GPT-3.5
ANALYSIS AND DISCUSSION
...and 4 more sections

Figures (7)

Figure 1: Comparison of the existing summarization model and our MoeSumm model. Our MoeSumm consists of a main summarization expert and multiple deputy experts, which can be used for or quickly adapt to different datasets.
Figure 2: Training MoeSumm under different settings. (a) Training the expert selector and all experts on multiple high-resource datasets. (b) Fine-tuning only the expert selector and the deputy experts on low-resource datasets.
Figure 3: Examples illustrating the max-margin loss $\mathcal{L}_{m}$ in two scenarios. (a) $\mathcal{L}_{m}$ is small when the main expert performs well, where both $P^{\text{full}}_{\text{word}}$ and $P^{\text{main}}_{\text{word}}$ for the target word surpassing other candidates. (b) $\mathcal{L}_{m}$ is large when the main model cannot perform well. In this scenario, minimizing the max-margin loss can maximize the margin $m_t$, thus preventing the overconfidence of the main model and stimulating deputy experts to learn to predict the correct target word.
Figure 4: Distribution of selected Deputy Experts (DE) associated with three different datasets.
Figure 5: Projection comparison between dataset attributes and their deputy expert utilization distribution.
...and 2 more figures

Flexible and Adaptable Summarization via Expertise Separation

TL;DR

Abstract

Flexible and Adaptable Summarization via Expertise Separation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)