Context-Former: Stitching via Latent Conditioned Sequence Modeling
Ziqi Zhang, Jingzehua Xu, Jinxin Liu, Zifeng Zhuang, Donglin Wang, Miao Liu, Shuai Zhang
TL;DR
ContextFormer introduces a latent HI-based stitching mechanism for decision making by endowing transformers with divergent sequential expert matching. By learning a latent contextual embedding z^* from a limited set of expert trajectories and optimizing a supervised policy loss, ContextFormer stitches sub-optimal trajectory fragments in the latent space, avoiding the conservatism of offline RL while enhancing generalization. Theoretical analysis connects the HI-based objective with the expert distribution and demonstrates how z^* aligns with expert HI under the expert-dominant regions of trajectory space. Empirically, ContextFormer achieves competitive IL performance and outperforms several DT variants on identical datasets, with strong results in maze2d stitching tasks and informative ablations on demonstration quantity, diversity, and quality.
Abstract
Offline reinforcement learning (RL) algorithms can learn better decision-making compared to behavior policies by stitching the suboptimal trajectories to derive more optimal ones. Meanwhile, Decision Transformer (DT) abstracts the RL as sequence modeling, showcasing competitive performance on offline RL benchmarks. However, recent studies demonstrate that DT lacks of stitching capacity, thus exploiting stitching capability for DT is vital to further improve its performance. In order to endow stitching capability to DT, we abstract trajectory stitching as expert matching and introduce our approach, ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectory fragments by emulating the representations of a limited number of expert trajectories. To validate our approach, we conduct experiments from two perspectives: 1) We conduct extensive experiments on D4RL benchmarks under the settings of IL, and experimental results demonstrate ContextFormer can achieve competitive performance in multiple IL settings. 2) More importantly, we conduct a comparison of ContextFormer with various competitive DT variants using identical training datasets. The experimental results unveiled ContextFormer's superiority, as it outperformed all other variants, showcasing its remarkable performance.
