Table of Contents
Fetching ...

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach

TL;DR

This work introduces UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness, and proposes a novel reward that is implemented within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

TL;DR

This work introduces UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness, and proposes a novel reward that is implemented within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
Paper Structure (58 sections, 2 theorems, 40 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 58 sections, 2 theorems, 40 equations, 11 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

Let $\texttt{pass@k}_B$ be the pass@k score of the base model on prompt $x$ and $\texttt{pass@k}_M$ be the pass@k score of the mixture model on prompt $x$. Under the above assumptions, we show that: where $C_1$ depends on $k$ and $C_2$ depends on $k$ and $\max_za_z$.

Figures (11)

  • Figure 1: UpSkill improves mean multi-attempt accuracy without hurting single-attempt accuracy on GSM8K for the Qwen 2.5-7B model (See Sec. \ref{['sec:gsm']}).
  • Figure 2: UpSkill is an unsupervised method for training LLMs to produce diverse responses. After training, different latent vectors $z$ (blue boxes above) correspond to different response strategies. Because of space constraints, the figure shows summarized responses from UpSkill; we report the full responses in \ref{['app:generation_figure']}.
  • Figure 3: Example illustration of how the MISL reward improves pass@k performance. Before MISL (left), the trajectory distribution is independent of the latents $z$, so the conditional entropy is close to the marginal. MISL training prevents distribution collapse due to pass@1 training (middle). Adding the token-level MI reward (right) yields well-separated clusters indexed by $z$, reducing conditional entropy while preserving high marginal entropy. At inference, fixing different $z$ values produces consistent and diverse solution strategies.
  • Figure 4: Arithmetic environment results. Training curves show that under GRPO alone (blue), pass@1 and pass@5 converge together, indicating that multiple attempts provide little benefit. With MISL (orange; $N{=}5$), pass@5 improves substantially while pass@1 remains modest, demonstrating that different latents yield complementary solutions. Operator distributions further highlight this effect: without MISL, they are nearly identical across $z$, reflecting a lack of specialization, whereas with MISL, distinct latents focus on different operators, producing diverse strategies that drive multi-attempt gains.
  • Figure 5: Performance on $500$ held-out problems with $N{=}5$ strategies. We observe gains on all metrics for the Qwen model. Base refers to the model before GRPO training, Without MI refers to after GRPO training without token MI, and With MI refers to training with correctness rewards and token MI. We test multiple configurations for With MI and Without MI and plot all successful runs as multiple bars, as elaborated in Appendix \ref{['app:gsm8k']}. For entries with multiple bars, the labeled value is the maximum.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Definition
  • Lemma 2: pass@k Improvement for $k$-uniform Mixture Models, Full Statement