Table of Contents
Fetching ...

Style Transfer with Multi-iteration Preference Optimization

Shuai Liu, Jonathan May

TL;DR

STAMP tackles text style transfer by combining a supervised fine-tuning stage on end-to-end pseudo-parallel data with a multi-iteration preference optimization stage. It introduces hope-and-fear sample-based data generation and a dynamic, weighted reward aggregation to balance three objectives: fluency $F$, meaning similarity $MS$, and target style strength $TSS$, formulated as $\mathcal{R} = TSS^{\alpha} \cdot MS^{\beta} \cdot F^{\gamma}$. The method uses end-to-end pseudo-parallel data generation, a unified transfer model with style-control codes, and iterative model updates via contrastive PO to progressively improve the transfer quality. Evaluations on CDS and GYAFC show STAMP achieving state-of-the-art results on automatic metrics and competitive human judgments, validating the effectiveness of multi-iteration PO and the proposed data-generation and reward-balancing techniques. Limitations include repetitions/hallucinations and task-dependent effectiveness of reward weighting, suggesting future work on more robust reward models and PO algorithms.

Abstract

Numerous recent techniques for text style transfer characterize their approaches as variants of reinforcement learning and preference optimization. In this work, we consider the relationship between these approaches and a class of optimization approaches developed primarily for (non-neural) statistical machine translation, formerly known as `tuning'. Inspired by these techniques from the past, we improve upon established preference optimization approaches, incorporating multiple iterations of exploration and optimization, and choosing contrastive examples by following a `hope' vs `fear' sampling strategy. Cognizant of the difference between machine translation and style transfer, however, we further tailor our framework with a new pseudo-parallel generation method and a dynamic weighted reward aggregation method to tackle the lack of parallel data and the need for a multi-objective reward. We evaluate our model on two commonly used text style transfer datasets. Through automatic and human evaluation results we show the effectiveness and the superiority of our model compared to state-of-the-art baselines.

Style Transfer with Multi-iteration Preference Optimization

TL;DR

STAMP tackles text style transfer by combining a supervised fine-tuning stage on end-to-end pseudo-parallel data with a multi-iteration preference optimization stage. It introduces hope-and-fear sample-based data generation and a dynamic, weighted reward aggregation to balance three objectives: fluency , meaning similarity , and target style strength , formulated as . The method uses end-to-end pseudo-parallel data generation, a unified transfer model with style-control codes, and iterative model updates via contrastive PO to progressively improve the transfer quality. Evaluations on CDS and GYAFC show STAMP achieving state-of-the-art results on automatic metrics and competitive human judgments, validating the effectiveness of multi-iteration PO and the proposed data-generation and reward-balancing techniques. Limitations include repetitions/hallucinations and task-dependent effectiveness of reward weighting, suggesting future work on more robust reward models and PO algorithms.

Abstract

Numerous recent techniques for text style transfer characterize their approaches as variants of reinforcement learning and preference optimization. In this work, we consider the relationship between these approaches and a class of optimization approaches developed primarily for (non-neural) statistical machine translation, formerly known as `tuning'. Inspired by these techniques from the past, we improve upon established preference optimization approaches, incorporating multiple iterations of exploration and optimization, and choosing contrastive examples by following a `hope' vs `fear' sampling strategy. Cognizant of the difference between machine translation and style transfer, however, we further tailor our framework with a new pseudo-parallel generation method and a dynamic weighted reward aggregation method to tackle the lack of parallel data and the need for a multi-objective reward. We evaluate our model on two commonly used text style transfer datasets. Through automatic and human evaluation results we show the effectiveness and the superiority of our model compared to state-of-the-art baselines.
Paper Structure (35 sections, 4 equations, 2 figures, 17 tables)

This paper contains 35 sections, 4 equations, 2 figures, 17 tables.

Figures (2)

  • Figure 1: An overview of STAMP, in which we first train a unified style transfer model using supervised fine-tuning on pseudo-parallel data generated from non-parallel data, and then further train the model using multi-iteration preference optimization on preference pairs constructed with hope-and-fear sampling.
  • Figure 2: The value of iterative CPO on performance in STAMP and STAMP with unweighted $\mathcal{R}$, shown on the CDS dataset (test split). Iteration 0 refers to the SFT model before PO.