Data Fusion-Enhanced Decision Transformer for Stable Cross-Domain Generalization
Guojian Wang, Quinson Hon, Xuyang Chen, Lin Zhao
TL;DR
DFDT addresses cross-domain generalization for Decision Transformer policies by explicitly restoring token-level stitchability across source and target dynamics. It fuses target data with selectively trusted source fragments using a two-level MMD+OT filtering framework, replaces brittle RTG with advantage-conditioned tokens, and employs a Q-guided regularizer to smooth trajectory junctions. Theoretical bounds connect value and policy gaps to stitchability radii and estimation errors, while experiments across gravity, kinematic, and morphology shifts show superior returns and improved sequence stability over strong offline RL and sequence baselines. By operating in trajectory space with data-filtered fusion and advantage-based conditioning, DFDT offers a principled approach to robust cross-domain transfer in offline reinforcement learning.
Abstract
Cross-domain shifts present a significant challenge for decision transformer (DT) policies. Existing cross-domain policy adaptation methods typically rely on a single simple filtering criterion to select source trajectory fragments and stitch them together. They match either state structure or action feasibility. However, the selected fragments still have poor stitchability: state structures can misalign, the return-to-go (RTG) becomes incomparable when the reward or horizon changes, and actions may jump at trajectory junctions. As a result, RTG tokens lose continuity, which compromises DT's inference ability. To tackle these challenges, we propose Data Fusion-Enhanced Decision Transformer (DFDT), a compact pipeline that restores stitchability. Particularly, DFDT fuses scarce target data with selectively trusted source fragments via a two-level data filter, maximum mean discrepancy (MMD) mismatch for state-structure alignment, and optimal transport (OT) deviation for action feasibility. It then trains on a feasibility-weighted fusion distribution. Furthermore, DFDT replaces RTG tokens with advantage-conditioned tokens, which improves the continuity of the semantics in the token sequence. It also applies a $Q$-guided regularizer to suppress junction value and action jumps. Theoretically, we provide bounds that tie state value and policy performance gaps to the MMD-mismatch and OT-deviation measures, and show that the bounds tighten as these two measures shrink. We show that DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, and further corroborate these gains with token-stitching and sequence-semantics stability analyses.
