PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer

Huashan Sun; Yixiao Wu; Yuhao Ye; Yizhe Yang; Yinghao Li; Jiawei Li; Yang Gao

PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer

Huashan Sun, Yixiao Wu, Yuhao Ye, Yizhe Yang, Yinghao Li, Jiawei Li, Yang Gao

TL;DR

This work introduces Public-Speaking Style Transfer (PSST), a task aimed at transforming long official texts into a public-speaking style. It grounds PSST in linguistic analysis, decomposes public-speaking into four sub-styles—Interactivity, Emotionality, Vividness, and Orality—and develops a fine-grained evaluation framework to assess style strength and semantic preservation. The framework combines a passage-level scoring pipeline with a QA-based semantic check, enabling evaluation-driven improvements to LLM-based stylization. Experimental results reveal significant gaps in current LLMs, notably over-stylization, uneven strength distribution, and substantial semantic degradation on long texts, highlighting the need for better evaluation methods and model capabilities. The work also discusses limitations and proposes directions for expanding data domains, sub-style coverage, token-length handling, model diversity, and ethical safeguards.

Abstract

Language style is necessary for AI systems to understand and generate diverse human language accurately. However, previous text style transfer primarily focused on sentence-level data-driven approaches, limiting exploration of potential problems in large language models (LLMs) and the ability to meet complex application needs. To overcome these limitations, we introduce a novel task called Public-Speaking Style Transfer (PSST), which aims to simulate humans to transform passage-level, official texts into a public-speaking style. Grounded in the analysis of real-world data from a linguistic perspective, we decompose public-speaking style into key sub-styles to pose challenges and quantify the style modeling capability of LLMs. For such intricate text style transfer, we further propose a fine-grained evaluation framework to analyze the characteristics and identify the problems of stylized texts. Comprehensive experiments suggest that current LLMs struggle to generate public speaking texts that align with human preferences, primarily due to excessive stylization and loss of semantic information.

PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer

TL;DR

Abstract

Paper Structure (73 sections, 3 equations, 27 figures, 10 tables)

This paper contains 73 sections, 3 equations, 27 figures, 10 tables.

Introduction
Related Work
Definition of Text Style
Evaluation of Text Style Transfer
Public-Speaking Style Transfer
Task Formulation
Source Data
Prior Fine-grained Analysis
Key Features of Public-Speaking Style
Fine-grained Evaluation System
Style Strength Evaluation
Fine-grained Evaluation Modeling
Text-Level Style Evaluation
Correlation
Semantic Preservation
...and 58 more sections

Figures (27)

Figure 1: Illustration of Public-Speaking Style Transfer (PSST). An AI model is requested to present a written text, such as a popular science article, to audiences vividly and engagingly. The example generated by ChatGPT in the figure shows excessive stylization (highlighted in red).
Figure 2: Pipeline of establishing the evaluation system of the PSST task and the experiment&analysis of LLMs. 1. The above depicts the process of establishing the evaluation system. Specifically, we begin with a comprehensive analysis of real-world data from a linguistic perspective and identify four key characteristics of public-speaking style. Subsequently, we employ GPT-3.5 to generate a sentence-level list-wise parallel corpus and train TinyLlama-1.1B as a scorer for each dimension. For semantic preservation, we utilize GPT-4 to generate QA pairs that focus on key information and logic in source texts. We then assess these pairs with a QA model applied to stylized text, using variations in model accuracy to evaluate semantic integrity. 2. The bottom presents the experiment and analysis of the PSST task for the current LLMs.
Figure 3: Human annotation on features of real public speaking data (Inner-Annotator Agreements: Krippendorff's $\alpha$ = 0.7773). "Interactivity", "Emotionality", "Filler words" and "vividness" are notable features.
Figure 4: Radar plot of text-level style strength of passages transferred by different LLMs.($800\pm200$ tokens).
Figure 5: Style strength distribution of passages ($800\pm200$ tokens) transferred by different LLMs in Interactivity and Orality.
...and 22 more figures

PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer

TL;DR

Abstract

PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (27)