Table of Contents
Fetching ...

Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation

Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Tao Chen, Zhuowan Li, Michael Bendersky, Hamed Zamani

TL;DR

The paper addresses long-form personalized text generation by enabling LLMs to reason over user context before responding. It introduces REST-PG, a multi-stage framework that first guides reasoning over personalized data and then applies Expectation-Maximization Reinforced Self-Training to align reasoning with user rewards, evaluated on the LongLaMP benchmark. REST-PG demonstrates significant performance gains over supervised fine-tuning and non-reasoning self-training baselines, with ablations highlighting the synergy between reasoning and reinforcement-based self-training. The approach advances personalized content generation by improving alignment with user preferences, albeit at the cost of higher latency due to reasoning steps, suggesting directions for efficiency enhancements.

Abstract

Personalized text generation requires a unique ability of large language models (LLMs) to learn from context that they often do not encounter during their standard training. One way to encourage LLMs to better use personalized context for generating outputs that better align with the user's expectations is to instruct them to reason over the user's past preferences, background knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG), a framework that trains LLMs to reason over personal data during response generation. REST-PG first generates reasoning paths to train the LLM's reasoning abilities and then employs Expectation-Maximization Reinforced Self-Training to iteratively train the LLM based on its own high-reward outputs. We evaluate REST-PG on the LongLaMP benchmark, consisting of four diverse personalized long-form text generation tasks. Our experiments demonstrate that REST-PG achieves significant improvements over state-of-the-art baselines, with an average relative performance gain of 14.5% on the benchmark.

Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation

TL;DR

The paper addresses long-form personalized text generation by enabling LLMs to reason over user context before responding. It introduces REST-PG, a multi-stage framework that first guides reasoning over personalized data and then applies Expectation-Maximization Reinforced Self-Training to align reasoning with user rewards, evaluated on the LongLaMP benchmark. REST-PG demonstrates significant performance gains over supervised fine-tuning and non-reasoning self-training baselines, with ablations highlighting the synergy between reasoning and reinforcement-based self-training. The approach advances personalized content generation by improving alignment with user preferences, albeit at the cost of higher latency due to reasoning steps, suggesting directions for efficiency enhancements.

Abstract

Personalized text generation requires a unique ability of large language models (LLMs) to learn from context that they often do not encounter during their standard training. One way to encourage LLMs to better use personalized context for generating outputs that better align with the user's expectations is to instruct them to reason over the user's past preferences, background knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG), a framework that trains LLMs to reason over personal data during response generation. REST-PG first generates reasoning paths to train the LLM's reasoning abilities and then employs Expectation-Maximization Reinforced Self-Training to iteratively train the LLM based on its own high-reward outputs. We evaluate REST-PG on the LongLaMP benchmark, consisting of four diverse personalized long-form text generation tasks. Our experiments demonstrate that REST-PG achieves significant improvements over state-of-the-art baselines, with an average relative performance gain of 14.5% on the benchmark.
Paper Structure (40 sections, 15 figures, 3 tables)

This paper contains 40 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: The overview of training pipeline of Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG).
  • Figure 2: The performance of our approach with different exploration budgets ($m$) when trained for one iteration on the test set. The same plot on validation sets is depicted in Figure \ref{['fig:exploration-dev']} in Appendix \ref{['app:results-dev']}.
  • Figure 3: The effect of number of expectation-maximization steps on the performance on the test set. The same plot on validation sets is depicted in Figure \ref{['fig:iteration-performance-dev']} in Appendix \ref{['app:results-dev']}.
  • Figure 4: The relative performance of our model trained from the base checkpoint and the SFT checkpoint for one iteration on the test set. The same plot on validation sets is depicted in Figure \ref{['fig:strat-sft-base-dev']} in Appendix \ref{['app:results-dev']}.
  • Figure 5: The affect of randomly shuffling profiles on the reward model's scores.
  • ...and 10 more figures