Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang; Hao Luo; Sipeng Zheng; Yicheng Feng; Haiweng Xu; Ziheng Xi; Chaoyi Xu; Haoqi Yuan; Zongqing Lu

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, Zongqing Lu

Abstract

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Abstract

Paper Structure (23 sections, 48 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 48 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Preliminaries and Notation
Posterior-Transition Reweighting
BeliefTokenizer
Posterior transition score
Theoretical foundations
Conservative reweighting and the induced training distribution
Adaptive scale control
Practical pipeline and gradient routing
Experiments
Setup and baselines
Simulation benchmarks
Robustness under corrupted training data
Real-robot evaluation and cross-embodiment transfer
...and 8 more sections

Figures (11)

Figure 1: Overview of PTR. Left: the standard policy stack (backbone + action expert) is augmented with a lightweight scorer and a BeliefTokenizer. Right: for each training chunk, the scorer identifies the matched post-action target among mismatched candidates; the resulting identification posterior is converted into a conservative weight that rescales the supervised action loss. No reward labels or policy likelihoods are needed.
Figure 2: Robustness under corrupted training data (success rate %). Colored badges show PTR$-$SFT deltas. PTR maintains higher performance across all corruption types.
Figure 3: Real-robot platforms used for evaluation. (a) Unitree G1 with LinkerHand O6 dexterous hands. (b) PND Adam-U with bimanual dexterous manipulation and a movable head. (c) FR3 single-arm with Inspire dexterous hand.
Figure 4: Representative real-robot tasks across three platforms and four capability suites.
Figure 5: Cross-embodiment task correspondence. Different robot platforms (Adam-U & FR3) execute semantically similar manipulation tasks, illustrating the shared post-action structure that enables PTR to selectively transfer useful knowledge.
...and 6 more figures

Theorems & Definitions (3)

proof
proof : Proof of Proposition \ref{['prop:kl_lens_main']} under the bounded-ratio regularity condition
proof : Proof of Proposition \ref{['prop:mixture_main']}

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Abstract

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Authors

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (3)