Style Transfer with Multi-iteration Preference Optimization
Shuai Liu, Jonathan May
TL;DR
STAMP tackles text style transfer by combining a supervised fine-tuning stage on end-to-end pseudo-parallel data with a multi-iteration preference optimization stage. It introduces hope-and-fear sample-based data generation and a dynamic, weighted reward aggregation to balance three objectives: fluency $F$, meaning similarity $MS$, and target style strength $TSS$, formulated as $\mathcal{R} = TSS^{\alpha} \cdot MS^{\beta} \cdot F^{\gamma}$. The method uses end-to-end pseudo-parallel data generation, a unified transfer model with style-control codes, and iterative model updates via contrastive PO to progressively improve the transfer quality. Evaluations on CDS and GYAFC show STAMP achieving state-of-the-art results on automatic metrics and competitive human judgments, validating the effectiveness of multi-iteration PO and the proposed data-generation and reward-balancing techniques. Limitations include repetitions/hallucinations and task-dependent effectiveness of reward weighting, suggesting future work on more robust reward models and PO algorithms.
Abstract
Numerous recent techniques for text style transfer characterize their approaches as variants of reinforcement learning and preference optimization. In this work, we consider the relationship between these approaches and a class of optimization approaches developed primarily for (non-neural) statistical machine translation, formerly known as `tuning'. Inspired by these techniques from the past, we improve upon established preference optimization approaches, incorporating multiple iterations of exploration and optimization, and choosing contrastive examples by following a `hope' vs `fear' sampling strategy. Cognizant of the difference between machine translation and style transfer, however, we further tailor our framework with a new pseudo-parallel generation method and a dynamic weighted reward aggregation method to tackle the lack of parallel data and the need for a multi-objective reward. We evaluate our model on two commonly used text style transfer datasets. Through automatic and human evaluation results we show the effectiveness and the superiority of our model compared to state-of-the-art baselines.
