VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

Zhicheng Zhang; Ziyan Wang; Yali Du; Fei Fang

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

TL;DR

Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set, is proposed, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

Abstract

Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

TL;DR

Abstract

Paper Structure (58 sections, 10 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 58 sections, 10 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Related Work
RL with verifiable rewards (RLVR)
Exploration in RL post-training
Chess-playing LLMs
Preliminaries
Chess notation and legal moves
Group Relative Policy Optimization (GRPO)
Methods
Action-masking MDP
Verbalized Action Masking (VAM)
Iterative action-space pruning
Verifier-based rewards and target actions
Chess implementation details
State, actions, and prompt interface
...and 43 more sections

Figures (8)

Figure 1: Overview of Verbalized Action Masking (VAM) with iterative action-space pruning. Given a state and a provided set of allowed actions, we prompt the LLM and sample rollout groups for Group Relative Policy Optimization (GRPO). Each sampled output is parsed into an action and then evaluated by a verifier to produce reward feedback. If the target condition is not met in the current round (for example, the target action is not sampled), we mask the distinct valid sampled actions out of the allowed action set and resample under the updated prompt. This iterative pruning procedure increases within-state action coverage and yields more informative grouped on-policy updates.
Figure 2: Prompt interface for VAM. Prompt inputs include the state description, format specification, and allowed action list; placeholders indicate where each appears.
Figure 3: Chess prompt with verbalized mask. Prompt inputs include the original prompt, FEN, legal moves, and allowed actions; placeholders indicate where each appears.
Figure 4: Performance on held-out chess puzzles. Pass@1 accuracy for selecting the dataset-provided solution move on the 10,000-position test set. The left panel compares fixed-dataset GRPO baselines under three engine-derived reward signals (expected-score proxy $\mu_{\text{exp}}$, win-rate $\mu_{\text{win}}$, and rank-based $\mu_{\text{rank}}$; Eqs. \ref{['eq:mu-exp']} and \ref{['eq:mu-rank']}). The right panel compares VAM with iterative action-space pruning against GRPO under the same fixed dataset and matched rollout budget for Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct. The horizontal axis counts GRPO gradient updates, and each point evaluates the current policy with fixed decoding settings over the full test set. Dashed horizontal lines indicate rejection-sampled SFT baselines.
Figure 5: Move quality in full games against Stockfish. Average centipawn loss (ACPL; lower is better) during evaluation against fixed Stockfish opponents at depth 1 and depth 5, shown for Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct. We report overall ACPL as well as per-opponent ACPL trajectories over training. VAM is trained with engine-play position generation and iterative action-space pruning, while the GRPO baseline uses the same verifier and prompt interface but omits pruning. Dashed horizontal lines indicate rejection-sampled SFT baselines. VAM consistently reaches lower ACPL across opponents and model sizes, indicating stronger move quality under full-game play.
...and 3 more figures

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

TL;DR

Abstract

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (8)