LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Timon Ziegenbein; Gabriella Skitalinskaya; Alireza Bayat Makou; Henning Wachsmuth

LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Timon Ziegenbein, Gabriella Skitalinskaya, Alireza Bayat Makou, Henning Wachsmuth

TL;DR

This work addresses automatic mitigation of inappropriate argumentation in online discourse. It introduces an RLHF-inspired rewriting framework that boots an initial policy via prompting instruction-tuned LLMs and then refines it through PPO using a reward that combines semantic similarity and appropriateness, with a KL penalty to keep the policy close to the prompt-derived baseline. The reward is defined as $r(x,rac{y}{hat{y}}) = \alpha \cdot c_{sim}(x,hat{y}) + (1-\alpha) \cdot c_{app}(hat{y})$, and the overall objective uses $R(x,hat{y}) = r(x,hat{y}) - \beta \log\left( \frac{\pi^{RL}_{\phi}(hat{y}|x)}{\pi^{PRT}(hat{y}|x)} \right)$, optimized with PPO. Key findings show that document-level rewriting on non-parallel data can reduce inappropriateness while largely preserving content and that the LLaMA-based policy with instruction tuning and PPO achieves superior automatic and manual assessments compared to several baselines. The work provides practical insights into reward design for RLHF in non-parallel settings and highlights reader-centric preferences for appropriateness, with implications for proactive moderation.

Abstract

Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropriate arguments of other users, which moderators then review. However, this kind of post-hoc moderation is expensive and time-consuming, and moderators are often overwhelmed by the amount and severity of flagged content. Instead, a promising alternative is to prevent negative behavior during content creation. This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness based on existing classifiers, prompting an instruction-finetuned large language model (LLM) as our initial policy. Unlike related style transfer tasks, rewriting inappropriate arguments allows deleting and adding content permanently. It is therefore tackled on document level rather than sentence level. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies. Systematic experiments on non-parallel data provide evidence that our approach can mitigate the inappropriateness of arguments while largely preserving their content. It significantly outperforms competitive baselines, including few-shot learning, prompting, and humans.

LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

TL;DR

, and the overall objective uses

, optimized with PPO. Key findings show that document-level rewriting on non-parallel data can reduce inappropriateness while largely preserving content and that the LLaMA-based policy with instruction tuning and PPO achieves superior automatic and manual assessments compared to several baselines. The work provides practical insights into reward design for RLHF in non-parallel settings and highlights reader-centric preferences for appropriateness, with implications for proactive moderation.

Abstract

Paper Structure (33 sections, 3 equations, 7 figures, 9 tables)

This paper contains 33 sections, 3 equations, 7 figures, 9 tables.

Introduction
Related Work
Proximal Policy Optimization in NLP
Approach
Problem Formulation
Prompting as an Initial Policy
Reward Modeling and Policy Learning
Data
Source Data
Extension
Experiments
Experimental Setup
Finding an Initial Policy
Creating Few-Shot Examples
Prompting Setup
...and 18 more sections

Figures (7)

Figure 1: Example of an inappropriate argument from the corpus of ziegenbein:2023 and the same argument after applying our approach. The used colors indicate which parts of the original argument were removed (red strikethrough) and which parts were added by our approach in the rewriting process (green).
Figure 2: Our approach to rewriting inappropriate arguments: The policy $\pi^{RL}$ is optimized using PPO to generate an improved version $\hat{y}$ from the input argument $x$ while preserving the content of $x$ as much as possible ($c_{sim}$) and making the argument more appropriate ($c_{app}$). This is based on reward $R$ obtained from the weighting of $r$ of the scalar classifier outputs and the KL-divergence between the initial policy $\pi^{PRT}$ and the current $\pi^{RL}$. Dashed lines: The probability distribution over the tokens is used as the output of the LLM.
Figure 3: Visual representation of the employed sampling strategies for six rewrite instances. Subfigure (a) illustrates all pairwise comparisons, while subfigures (b, c, d) depict S-Window sampling at $\lambda$ values of 2, 3, and 4, respectively. Each subfigure comprises a matrix, where grey-colored cells indicate sampled comparisons between a pair of rewrites ($R_i$ and $R_j$) , and an accompanying graphical representation, where the edges in the graph indicidate sampled pairwise comparisons.
Figure :
Figure :
...and 2 more figures

LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

TL;DR

Abstract

LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (7)