StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation
Shenyang Liu, Yang Gao, Shaoyan Zhai, Liqiang Wang
TL;DR
This work tackles prompt recovery in the niche of writing style transformation prompts by introducing StyleRec, a high-quality benchmark dataset generated from YouTube transcripts and validated through meaning- and cycle-consistency checks. It systematically compares multiple recovery strategies—direct inference, jailbreak, chain-of-thought, LoRA-based fine-tuning, and a canonical-prompt fallback—across two LLMs, revealing that one-shot prompting and fine-tuning often yield the strongest results, while traditional metrics may fail to capture semantic fidelity. The study highlights the limitations of BLEU and Exact Match for prompt-recovery evaluation and demonstrates that Sharpened Cosine Similarity and token-level F1 provide more reliable guidance in this setting. Overall, StyleRec enables targeted progress in prompt recovery for unrestricted input prompts and points to the need for improved evaluation metrics and scalable data-generation pipelines for broader generalization.
Abstract
Prompt Recovery, reconstructing prompts from the outputs of large language models (LLMs), has grown in importance as LLMs become ubiquitous. Most users access LLMs through APIs without internal model weights, relying only on outputs and logits, which complicates recovery. This paper explores a unique prompt recovery task focused on reconstructing prompts for style transfer and rephrasing, rather than typical question-answering. We introduce a dataset created with LLM assistance, ensuring quality through multiple techniques, and test methods like zero-shot, few-shot, jailbreak, chain-of-thought, fine-tuning, and a novel canonical-prompt fallback for poor-performing cases. Our results show that one-shot and fine-tuning yield the best outcomes but highlight flaws in traditional sentence similarity metrics for evaluating prompt recovery. Contributions include (1) a benchmark dataset, (2) comprehensive experiments on prompt recovery strategies, and (3) identification of limitations in current evaluation metrics, all of which advance general prompt recovery research, where the structure of the input prompt is unrestricted.
