Table of Contents
Fetching ...

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, Zongqing Lu

Abstract

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Abstract

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.
Paper Structure (23 sections, 48 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 48 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of PTR. Left: the standard policy stack (backbone + action expert) is augmented with a lightweight scorer and a BeliefTokenizer. Right: for each training chunk, the scorer identifies the matched post-action target among mismatched candidates; the resulting identification posterior is converted into a conservative weight that rescales the supervised action loss. No reward labels or policy likelihoods are needed.
  • Figure 2: Robustness under corrupted training data (success rate %). Colored badges show PTR$-$SFT deltas. PTR maintains higher performance across all corruption types.
  • Figure 3: Real-robot platforms used for evaluation. (a) Unitree G1 with LinkerHand O6 dexterous hands. (b) PND Adam-U with bimanual dexterous manipulation and a movable head. (c) FR3 single-arm with Inspire dexterous hand.
  • Figure 4: Representative real-robot tasks across three platforms and four capability suites.
  • Figure 5: Cross-embodiment task correspondence. Different robot platforms (Adam-U & FR3) execute semantically similar manipulation tasks, illustrating the shared post-action structure that enables PTR to selectively transfer useful knowledge.
  • ...and 6 more figures

Theorems & Definitions (3)

  • proof
  • proof : Proof of Proposition \ref{['prop:kl_lens_main']} under the bounded-ratio regularity condition
  • proof : Proof of Proposition \ref{['prop:mixture_main']}