PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer
Huashan Sun, Yixiao Wu, Yuhao Ye, Yizhe Yang, Yinghao Li, Jiawei Li, Yang Gao
TL;DR
This work introduces Public-Speaking Style Transfer (PSST), a task aimed at transforming long official texts into a public-speaking style. It grounds PSST in linguistic analysis, decomposes public-speaking into four sub-styles—Interactivity, Emotionality, Vividness, and Orality—and develops a fine-grained evaluation framework to assess style strength and semantic preservation. The framework combines a passage-level scoring pipeline with a QA-based semantic check, enabling evaluation-driven improvements to LLM-based stylization. Experimental results reveal significant gaps in current LLMs, notably over-stylization, uneven strength distribution, and substantial semantic degradation on long texts, highlighting the need for better evaluation methods and model capabilities. The work also discusses limitations and proposes directions for expanding data domains, sub-style coverage, token-length handling, model diversity, and ethical safeguards.
Abstract
Language style is necessary for AI systems to understand and generate diverse human language accurately. However, previous text style transfer primarily focused on sentence-level data-driven approaches, limiting exploration of potential problems in large language models (LLMs) and the ability to meet complex application needs. To overcome these limitations, we introduce a novel task called Public-Speaking Style Transfer (PSST), which aims to simulate humans to transform passage-level, official texts into a public-speaking style. Grounded in the analysis of real-world data from a linguistic perspective, we decompose public-speaking style into key sub-styles to pose challenges and quantify the style modeling capability of LLMs. For such intricate text style transfer, we further propose a fine-grained evaluation framework to analyze the characteristics and identify the problems of stylized texts. Comprehensive experiments suggest that current LLMs struggle to generate public speaking texts that align with human preferences, primarily due to excessive stylization and loss of semantic information.
