Table of Contents
Fetching ...

Auditable-choice reframing unlocks RL-based verification for open-ended tasks

Mengyu Zhang, Xubo Liu, Siyu Ding, Weichong Yin, Yu Sun, Hua Wu, Wenya Guo, Ying Zhang

TL;DR

This work addresses the verifier dependency limitation of Reinforcement Learning with Verifiable Rewards (RLVR) for open-ended tasks lacking ground-truth solutions. It introduces Verifiable Multiple-Choice Reformulation (VMR), which recasts open-ended supervision as verifiable binary-choice problems using a candidate set {y^+, y^-} and randomized option ordering, enabling RLVR-style optimization with a binary reward R*(y; y^+, y^-). Empirically, VMR-RLVR yields a 5.99-point average improvement across eight open-ended benchmarks on a 14B-scale model, with notable gains in creative writing and instruction following, and shows robust reasoning quality improvements (higher reasoning density) without simply increasing output length. The approach broadens the applicability of verifiable reasoning to diverse real-world tasks and provides a principled path toward stronger, more efficient reasoning in open-ended contexts.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.

Auditable-choice reframing unlocks RL-based verification for open-ended tasks

TL;DR

This work addresses the verifier dependency limitation of Reinforcement Learning with Verifiable Rewards (RLVR) for open-ended tasks lacking ground-truth solutions. It introduces Verifiable Multiple-Choice Reformulation (VMR), which recasts open-ended supervision as verifiable binary-choice problems using a candidate set {y^+, y^-} and randomized option ordering, enabling RLVR-style optimization with a binary reward R*(y; y^+, y^-). Empirically, VMR-RLVR yields a 5.99-point average improvement across eight open-ended benchmarks on a 14B-scale model, with notable gains in creative writing and instruction following, and shows robust reasoning quality improvements (higher reasoning density) without simply increasing output length. The approach broadens the applicability of verifiable reasoning to diverse real-world tasks and provides a principled path toward stronger, more efficient reasoning in open-ended contexts.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.

Paper Structure

This paper contains 14 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overall performance on eight open-ended benchmarks. By applying our proposed VMR method to the pairwise data, the resulting approach consistently improves performance across various benchmarks, even outperforming strong baselines I$\&$II driven by model-based rewarding.
  • Figure 2: Rule-based RLVR ensures precise rewards but cannot handle open-ended tasks, while RM-based methods extend to such tasks at the cost of bias and reward hacking. Our VMR-based approach reformulates supervision into verifiable multiple-choice questions, combining RLVR’s rigor with broad open-ended applicability.
  • Figure 3: For each open-ended input, we construct a candidate set consisting of a chosen answer and a rejected answer. The two options are randomly ordered to form a multiple-choice question, and the model $\pi_\theta$ is tasked with selecting the correct one. A verifier then provides binary feedback, enabling RLVR-style optimization in open-ended domains without explicit ground-truth references.
  • Figure 4: Analysis of length and reasoning density.
  • Figure 5: UMAP visualization of embedding distributions. The first row shows results on ArenaHard2.0-CreativeWriting, while the second row shows results on CreativeWriting-V3.