Table of Contents
Fetching ...

Text2Grad: Reinforcement Learning from Natural Language Feedback

Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

TL;DR

Text2Grad addresses the limitations of scalar RLHF rewards by converting natural-language feedback into span-level rewards and token-level gradients through a three-stage pipeline: dual-feedback annotation, reward-model learning, and NL-Gradient policy optimization. It defines the Natural Language Gradient to ground updates in specific tokens, enabling precise credit assignment and interpretability across summarization, code generation, and open-domain QA. Empirical results show Text2Grad outperforming scalar-reward PPO and prompt-based baselines, with faster convergence and richer explanations of model behavior. The work demonstrates that natural-language feedback can serve as a principled training signal when grounded to tokens, offering a promising direction for more efficient and interpretable alignment of large language models.

Abstract

Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

Text2Grad: Reinforcement Learning from Natural Language Feedback

TL;DR

Text2Grad addresses the limitations of scalar RLHF rewards by converting natural-language feedback into span-level rewards and token-level gradients through a three-stage pipeline: dual-feedback annotation, reward-model learning, and NL-Gradient policy optimization. It defines the Natural Language Gradient to ground updates in specific tokens, enabling precise credit assignment and interpretability across summarization, code generation, and open-domain QA. Empirical results show Text2Grad outperforming scalar-reward PPO and prompt-based baselines, with faster convergence and richer explanations of model behavior. The work demonstrates that natural-language feedback can serve as a principled training signal when grounded to tokens, offering a promising direction for more efficient and interpretable alignment of large language models.

Abstract

Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

Paper Structure

This paper contains 54 sections, 7 equations, 6 figures, 11 tables, 4 algorithms.

Figures (6)

  • Figure 1: Comparison of PPO and Text2Grad
  • Figure 2: An overview of Text2Grad.
  • Figure 3: Combined figure for SLF5K dataset analysis.
  • Figure 4: A case study from the code generation scenario comparing PPO vs. Text2Grad.
  • Figure 5: Comparative analysis of training dynamics between Text2Grad and standard PPO. The results demonstrate that Text2Grad (red line) achieves more stable and consistent learning progress, while standard PPO (blue line) shows significant volatility and unstable oscillations throughout the training process.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: Natural Language Gradient