Table of Contents
Fetching ...

SERL: Self-Examining Reinforcement Learning on Open-Domain

Weixuan Ou, Yanzhao Zheng, Shuoshuo Sun, Wei Zhang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Pengwei Yan, Yifan Qiao

TL;DR

SERL introduces a self-examining reinforcement learning framework that allows a single LLM to act as both Actor and Judge, eliminating external reward models for open-domain tasks. It uses Copeland-style pairwise judgments to derive an intrinsic Actor reward and a self-consistency Reward for the Judge, with a Length Control Module and Position Bias Mitigation to stabilize training. The approach achieves state-of-the-art results among self-improving methods and matches or approaches the performance of much larger models on summarization, open writing, and general QA, demonstrating strong robustness and scalability. This work highlights a practical pathway to scalable, reward-free self-improvement in open-domain NLP applications with minimal supervision.

Abstract

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

SERL: Self-Examining Reinforcement Learning on Open-Domain

TL;DR

SERL introduces a self-examining reinforcement learning framework that allows a single LLM to act as both Actor and Judge, eliminating external reward models for open-domain tasks. It uses Copeland-style pairwise judgments to derive an intrinsic Actor reward and a self-consistency Reward for the Judge, with a Length Control Module and Position Bias Mitigation to stabilize training. The approach achieves state-of-the-art results among self-improving methods and matches or approaches the performance of much larger models on summarization, open writing, and general QA, demonstrating strong robustness and scalability. This work highlights a practical pathway to scalable, reward-free self-improvement in open-domain NLP applications with minimal supervision.

Abstract

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

Paper Structure

This paper contains 52 sections, 14 equations, 11 figures, 21 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of SERL. Given a instruction, the Actor first samples a group of responses. The Judge samples pairwise comparison judgments between response pairs. The judgments are aggregated via Copeland method to yield the Reward for Actor. Next, the consistency between the judgments and the ranking implied by the Reward for Actor is computed to generate the Reward for Judge. This process jointly enhances generation ability and comparative evaluation ability.
  • Figure 2: Illustration of the Copeland method.
  • Figure 3: The win rate against Qwen3-8B on summarization and open writing, and the LC win rate on general QA with SERL training step.
  • Figure 4: Consistency of evaluation results across different evaluators on the summarization task.(a) SERL vs. Qwen3-32B.(b) SERL vs. the base model Qwen3-8B.
  • Figure 5: Comparison of average output length changes between the complete method and the method without length control mechanism during summarization training.
  • ...and 6 more figures