Table of Contents
Fetching ...

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

Abstract

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Abstract

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting -power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution () to intensify logical reasoning, or flattening it () to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
Paper Structure (37 sections, 2 theorems, 17 equations, 8 figures, 2 tables)

This paper contains 37 sections, 2 theorems, 17 equations, 8 figures, 2 tables.

Key Result

Proposition 3.1

For a trajectory $\tau=(s_0, s_1, \dots, s_T)$, define the Trajectory Balance objective as: If the learned partition function $Z_\phi$ is optimized to its equilibrium, the expected gradient of $\mathcal{L}_{\text{TB}}$ with respect to the forward policy parameters $\theta$ satisfies:

Figures (8)

  • Figure 1: Illustration of the PowerFlow framework for directional capability elicitation. By matching the length-aware $\alpha$-power distribution, PowerFlow can either sharpen the distribution ($\alpha > 1$) to enhance logical reasoning or flatten it ($\alpha < 1$) to restore latent creativity. The right panels illustrate significant performance gains and a clear Pareto improvement over existing baselines.
  • Figure 2: The PowerFlow framework. During training (top), the policy $\pi_\theta$ and $\log Z'_\phi$ module are optimized via the LA-TB objective to match the $\alpha$-power distribution of the base model while neutralizing length bias. The control knob $\alpha$ enables directional elicitation: sharpening ($\alpha > 1$) for reasoning or flattening ($\alpha < 1$) for creativity. The inference pipeline (bottom) remains standard.
  • Figure 3: Stability analysis of distribution matching strategies. Matching the trajectory-level $\alpha$-power distribution via standard TB or RL objectives (-traj) leads to rapid length collapse. Token-level normalization (-token) initially improves performance but eventually decays due to the exploitation of repetitive tokens. PowerFlow maintains both stable response length and superior reasoning accuracy (pass@1 on MATH) throughout training.
  • Figure 4: Comparison of solution diversity scores on AIME24/25. PowerFlow maintains superior strategy variety.
  • Figure 5: Quality vs. Semantic Diversity on creative writing tasks. The shaded region indicates the area of Pareto improvement relative to the Instruct baseline. PowerFlow (stars) consistently shifts the Pareto frontier outward across all model scales.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 3.1: zimmermann2023a
  • Theorem D.1: Asymptotic Convergence to Dirac Distribution
  • proof