SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation
Sangwoo Shin, Minjong Yoo, Jeongwoo Lee, Honguk Woo
TL;DR
SemTra tackles cross-domain zero-shot policy adaptation for long-horizon tasks by introducing a two-phase framework that first translates interleaved multi-modal prompts into semantic skill sequences using multi-modal skill encoders and a PLM-based semantic skill decoder, then instantiates these skills as executable actions through domain-context and online-context encoders coupled with a behavior decoder. The approach demonstrates robust cross-domain transfer across robotic manipulation and autonomous driving environments (e.g., Franka Kitchen, Meta-World, RLBench, CARLA), outperforming strong baselines in both task- and skill-level evaluations and showing meaningful semantic alignment in the multi-modal space. These results highlight SemTra’s potential to enable zero-shot deployment of complex policies in safety-critical settings by leveraging semantic reasoning and context-aware skill instantiation. The work offers practical implications for cognitive robots and autonomous systems that must interpret abstract instructions and adapt to varied configurations without extensive domain-specific fine-tuning.
Abstract
This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.
