Table of Contents
Fetching ...

Fibration Policy Optimization

Chang Li, Tshihao Tsu, Yaren Zhang, Chao Xue, Xiaodong He

TL;DR

The Aggregational Policy Censoring Objective is derived, the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem.

Abstract

Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.

Fibration Policy Optimization

TL;DR

The Aggregational Policy Censoring Objective is derived, the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem.

Abstract

Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.
Paper Structure (131 sections, 13 theorems, 183 equations, 5 figures)

This paper contains 131 sections, 13 theorems, 183 equations, 5 figures.

Key Result

Theorem 2.1

When $\gamma = 1$, both the TV-based and KL-based TRPO trust regions collapse to the reference policy (Definition def:trust_region): Consequently, the only policy update permitted by TV-based TRPO's trust region at $\gamma=1$ is the trivial one: $\pi_{\theta_{new}}=\pi_{\theta_{old}}$.

Figures (5)

  • Figure 1: Fiber bundle model for sampled RLHF data. The base space $B=\mathrm{Tj}^{\theta_{\rm old}}\!\times\!\{-1,+1\}$ consists of two separate lines indexed by the sign label $l\in\{-1,+1\}$, encoding trajectory-level global information as densities on $B$. Each fiber $\pi_E^{-1}[\{(\tau,l)\}]$ collects the per-token data points belonging to trajectory $\tau$ with sign $l$, encoding local information as densities on the total space $E$. The maps $\mathcal{F}$ and $\mathcal{R}$ convert between policy ratios and densities.
  • Figure 2: Base weight visualization. Panel (a) shows the aggregate gate $g^{\rm agg}$ (Eq. \ref{['eq:gagg_main']}) with three regimes: pass-through ($|x| \leq C$, slope $1$), rollback ($C < |x| < C^*$, slope $-k$), and zeroed ($|x| \geq C^*$, output $0$). As $k = T_\tau$ increases, the rollback regime narrows (width $C/T_\tau$) and $g^{\rm agg}$ approaches a hard clip at $\pm C$. Panel (b) shows the base weight $\log w_\tau^{\rm base}$ (Eq. \ref{['eq:base_weight']}) in $(\log s^+, \log s^-)$-space with asymmetric thresholds $C^+ = 0.20$, $C^- = 0.1$ ($k = 2$). Dashed lines mark the budget boundaries $C^\pm$ (onset of rollback); dotted lines mark the full-gating thresholds $C^{*\pm}$ (onset of zeroing). The five global regimes (G-I through G-III) follow a non-monotonic pattern: $|\log w|$ rises through the rollback onset (G-II,r), peaks when one channel is fully gated (G-II), declines under mutual rollback (G-III,r), and collapses to zero when both channels are fully gated (G-III, $w^{\rm base}_\tau = 1$).
  • Figure 3: Regime map on the probability simplex given by all possible configurations of a trajectory's policy ratios (subject to probabilistic constraint ($\frac{1}{T_{\rm trajectory}}\sum_{t=0}^{T_{\rm trajectory}-1} r_t = 1$) where $T{=}3$, $\varepsilon{=}0.025$, $C^+{=}0.15$, $C^-{=}0.09$, $K{=}3$, $C^{*\pm}{=}(1{+}1/K)\,C^{\pm}$). (a) Local branch (zoomed $\pm 0.18$ around centroid): L-I (no clip, central polytope inside teal boundary), L-II (partial clip, between teal solid and purple solid boundaries), and L-III (full clip, outside purple boundary). Red and blue dashed contours show the rollback thresholds $N{=}C^-$ and $P{=}C^+$. (b) Global branch (full simplex): five concentric regimes---G-I (inside red solid $N{=}C^-$), G-II,r (rollback transition, between red solid $N{=}C^-$ and red dotted $N{=}C^{*-}$), G-II (between red dotted $N{=}C^{*-}$ and blue solid $P{=}C^+$), G-III,r (rollback transition, between blue solid $P{=}C^+$ and blue dotted $P{=}C^{*+}$), and G-III (outside blue dotted $P{=}C^{*+}$).
  • Figure 4: Per-token and total FiberPO objective under parameterized drift $r_i = 1 + t\hat{\Delta}_i$ ($T{=}10$, $k{=}32$, $\epsilon{=}0.04$, $C^+{=}0.12$, $C^-{=}0.05$), with $\operatorname{sign}(\hat{\Delta}_i) = \operatorname{sign}(\hat{A}_i)$, reflecting the typical policy update in which tokens with positive advantage increase their probability ($\hat{\Delta}_i > 0$) and tokens with negative advantage decrease their probability ($\hat{\Delta}_i < 0$). GRPO in the graph uses the same $\epsilon$-clip as FiberPO. (a) Trajectory in $(\log s^+, \log s^-)$-space overlaid on the $\log w$ heatmap; as drift increases, the trajectory moves through the global regimes of Figure \ref{['fig:fiber_weight']}(b). (b) Per-token fiber correction at $t{=}1$: fiber residual (red) and base-weight contribution (orange) stack to form the full gated log-ratio. (c) Per-token objective for two representative tokens. (d) Total objective: FiberPO's $g^{\rm agg}$ and $\operatorname{logclip}$ saturate, causing the objective to return to baseline. Vertical markers: $\epsilon$-clip transitions (gray, L-I$\to$L-II), $C^-$ rollback (red, G-II,r onset), $C^+$ rollback (blue, G-II,r onset).
  • Figure 5: Derivation roadmap from APC-Obj to FiberPO. The APC-Obj objective uses a single clipping with cross-token interaction to enforce a per-state TV budget; the two scales of stability information (trajectory-level aggregate ratios $s_\tau^\pm$ and per-token ratios $r_{s,a}$) are implicitly entangled within this clipping. Through four successive relaxations, the key step (Derivation IV) decomposes the single clipping to make the two scales explicit, yielding separate global and local gates that fit naturally into the FBG form. The resulting FiberPO inherits trust-region stability from APC-Obj and structural guarantees (first-order agreement, global-local decoupling, information exchange, extensibility) from the FBG framework.

Theorems & Definitions (62)

  • Theorem 2.1: TRPO vanishing theorem
  • Definition 3.1: Ratio Gating Formalism (RGF)
  • Definition 3.2: Aggregational Policy Censoring Objective, APC-Obj policy iteration
  • Theorem 3.3: Sample-based TV-TRPO and APC-Obj equivalence; restatement of Theorem \ref{['thm:apc_obj_trpo_equiv']}
  • Remark 3.4: APC-Obj as the structural bridge for $\delta$-relaxation
  • Claim 3.5
  • Remark 3.6
  • Remark 4.1
  • Remark 4.2: Choice of base space and extensibility
  • Definition 4.3: Fiber Bundle Gating
  • ...and 52 more