Table of Contents
Fetching ...

Guiding ChatGPT to Generate Salient Domain Summaries

Jun Gao, Ziqiang Cao, Shaoyao Huang, Luozheng Qin, Chunhui Ai

TL;DR

The paper tackles the challenge of domain-salient, zero-shot summarization with ChatGPT by proposing PADS, a lightweight pipeline that combines in-context demonstration retrieval with a separate rank model. A dense retriever (SBERT) selects relevant demonstrations, which are used in multi-turn prompts to generate multiple candidate summaries; a Bart-large-based ranker then scores and selects the best summary conditioned on the input document. Training the ranker uses a two-phase approach: contrastive learning to distinguish high-quality continuations and a regression head trained on normalized ROUGE-L scores, with only 400M trainable parameters in the ranker and 2.5k data points per dataset. Empirical results across five diverse datasets show consistent gains over zero-shot ChatGPT and competitive baselines, with notable improvements on Gigaword (+8 ROUGE-L over zero-shot) and insights into retrieval choices, multi-turn prompting, and upper-bound analyses. The approach offers a practical, parameter-efficient path to improve domain-specific salience in LLM-generated summaries and highlights future directions in demonstration compression and latency management.

Abstract

ChatGPT is instruct-tuned to generate general and human-expected content to align with human preference through Reinforcement Learning from Human Feedback (RLHF), meanwhile resulting in generated responses not salient enough. Therefore, in this case, ChatGPT may fail to satisfy domain requirements in zero-shot settings, leading to poor ROUGE scores. Inspired by the In-Context Learning (ICL) and retelling ability of ChatGPT, this paper proposes PADS, a \textbf{P}ipeline for \textbf{A}ssisting ChatGPT in \textbf{D}omain \textbf{S}ummarization. PADS consists of a retriever to retrieve similar examples from corpora and a rank model to rerank the multiple candidate summaries generated by ChatGPT. Specifically, given an inference document, we first retrieve an in-context demonstration via the retriever. Then, we require ChatGPT to generate $k$ candidate summaries for the inference document at a time under the guidance of the retrieved demonstration. Finally, the rank model independently scores the $k$ candidate summaries according to their quality and selects the optimal one. We extensively explore dense and sparse retrieval methods to select effective demonstrations for reference and efficiently train the rank model to reflect the quality of candidate summaries for each given summarized document. Additionally, PADS contains merely 400M trainable parameters originating from the rank model and we merely collect 2.5k data to train it. We evaluate PADS on five datasets from different domains, and the result indicates that each module in PADS is committed to effectively guiding ChatGPT to generate salient summaries fitting different domain requirements. Specifically, in the popular summarization dataset Gigaword, PADS achieves over +8 gain on ROUGE-L, compared with the naive ChatGPT in the zero-shot setting. \footnote{Our code are available at \url{https://github.com/jungao1106/PADS}}

Guiding ChatGPT to Generate Salient Domain Summaries

TL;DR

The paper tackles the challenge of domain-salient, zero-shot summarization with ChatGPT by proposing PADS, a lightweight pipeline that combines in-context demonstration retrieval with a separate rank model. A dense retriever (SBERT) selects relevant demonstrations, which are used in multi-turn prompts to generate multiple candidate summaries; a Bart-large-based ranker then scores and selects the best summary conditioned on the input document. Training the ranker uses a two-phase approach: contrastive learning to distinguish high-quality continuations and a regression head trained on normalized ROUGE-L scores, with only 400M trainable parameters in the ranker and 2.5k data points per dataset. Empirical results across five diverse datasets show consistent gains over zero-shot ChatGPT and competitive baselines, with notable improvements on Gigaword (+8 ROUGE-L over zero-shot) and insights into retrieval choices, multi-turn prompting, and upper-bound analyses. The approach offers a practical, parameter-efficient path to improve domain-specific salience in LLM-generated summaries and highlights future directions in demonstration compression and latency management.

Abstract

ChatGPT is instruct-tuned to generate general and human-expected content to align with human preference through Reinforcement Learning from Human Feedback (RLHF), meanwhile resulting in generated responses not salient enough. Therefore, in this case, ChatGPT may fail to satisfy domain requirements in zero-shot settings, leading to poor ROUGE scores. Inspired by the In-Context Learning (ICL) and retelling ability of ChatGPT, this paper proposes PADS, a \textbf{P}ipeline for \textbf{A}ssisting ChatGPT in \textbf{D}omain \textbf{S}ummarization. PADS consists of a retriever to retrieve similar examples from corpora and a rank model to rerank the multiple candidate summaries generated by ChatGPT. Specifically, given an inference document, we first retrieve an in-context demonstration via the retriever. Then, we require ChatGPT to generate candidate summaries for the inference document at a time under the guidance of the retrieved demonstration. Finally, the rank model independently scores the candidate summaries according to their quality and selects the optimal one. We extensively explore dense and sparse retrieval methods to select effective demonstrations for reference and efficiently train the rank model to reflect the quality of candidate summaries for each given summarized document. Additionally, PADS contains merely 400M trainable parameters originating from the rank model and we merely collect 2.5k data to train it. We evaluate PADS on five datasets from different domains, and the result indicates that each module in PADS is committed to effectively guiding ChatGPT to generate salient summaries fitting different domain requirements. Specifically, in the popular summarization dataset Gigaword, PADS achieves over +8 gain on ROUGE-L, compared with the naive ChatGPT in the zero-shot setting. \footnote{Our code are available at \url{https://github.com/jungao1106/PADS}}
Paper Structure (34 sections, 3 equations, 3 figures, 6 tables)

This paper contains 34 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Workflow of PADS. Text is the user input to be summarized and corpora is the demonstrations to be retrieved. Reference and Inference are manually designed prompts in different conversation turns to distinguish inference documents from demonstrations. The optional format prompt is engaged in another conversation turns to correct the output format of ChatGPT if necessary.
  • Figure 2: The relative ROUGE scores difference of summaries with highest and lowest scores among five candidate summaries. The first candidate's scores serve as the baseline.
  • Figure 3: The relative ROUGE scores of ChatGPT Similar. The scores of ChatGPT Zero serve as the baseline.