Table of Contents
Fetching ...

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

Sangwoo Shin, Minjong Yoo, Jeongwoo Lee, Honguk Woo

TL;DR

SemTra tackles cross-domain zero-shot policy adaptation for long-horizon tasks by introducing a two-phase framework that first translates interleaved multi-modal prompts into semantic skill sequences using multi-modal skill encoders and a PLM-based semantic skill decoder, then instantiates these skills as executable actions through domain-context and online-context encoders coupled with a behavior decoder. The approach demonstrates robust cross-domain transfer across robotic manipulation and autonomous driving environments (e.g., Franka Kitchen, Meta-World, RLBench, CARLA), outperforming strong baselines in both task- and skill-level evaluations and showing meaningful semantic alignment in the multi-modal space. These results highlight SemTra’s potential to enable zero-shot deployment of complex policies in safety-critical settings by leveraging semantic reasoning and context-aware skill instantiation. The work offers practical implications for cognitive robots and autonomous systems that must interpret abstract instructions and adapt to varied configurations without extensive domain-specific fine-tuning.

Abstract

This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

TL;DR

SemTra tackles cross-domain zero-shot policy adaptation for long-horizon tasks by introducing a two-phase framework that first translates interleaved multi-modal prompts into semantic skill sequences using multi-modal skill encoders and a PLM-based semantic skill decoder, then instantiates these skills as executable actions through domain-context and online-context encoders coupled with a behavior decoder. The approach demonstrates robust cross-domain transfer across robotic manipulation and autonomous driving environments (e.g., Franka Kitchen, Meta-World, RLBench, CARLA), outperforming strong baselines in both task- and skill-level evaluations and showing meaningful semantic alignment in the multi-modal space. These results highlight SemTra’s potential to enable zero-shot deployment of complex policies in safety-critical settings by leveraging semantic reasoning and context-aware skill instantiation. The work offers practical implications for cognitive robots and autonomous systems that must interpret abstract instructions and adapt to varied configurations without extensive domain-specific fine-tuning.

Abstract

This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.
Paper Structure (57 sections, 12 equations, 13 figures, 22 tables, 2 algorithms)

This paper contains 57 sections, 12 equations, 13 figures, 22 tables, 2 algorithms.

Figures (13)

  • Figure 1: Cross-domain zero-shot adaptation for a multi-modal task prompt: our framework is given a task prompt filled with partial demonstrations and instructed contextual cues in multi-modal snippets. The framework conducts a two-phase adaptation, initially translating the snippets to semantic skills at the task level, and subsequently optimizing them into actions for the target domain at the skill level.
  • Figure 2: Two-phase policy adaptation in $\textnormal{SemTra}$. (1) Task adaptation: the multi-modal skill encoders $\Phi_E$ produce a skill-level language instruction from the task prompt. In the figure, we specifically describe the training of a video skill encoder, contrastively learned through a pretrained VLM $(\Psi_V, \Psi_L)$. The skill-level instruction is then translated into a semantic skill sequence through the semantic skill sequence generator $\Phi_G$ based on a PLM. The skill boundary detector $\Phi_B$ infers the boundary of semantic skills upon a current state in the target domain. (2) Skill adaptation: the context encoder $\Phi^{(g)}_C$ identifies instructed domain contexts using both the skill-level instruction and the semantic skill sequence, generating an executable skill sequence. The online context encoder $\Phi^{(o)}_C$ captures environment hidden contexts at runtime. The behavior decoder $\pi$ generates actions optimized for the target domain, based on the executable skills and environment hidden contexts, along with the current state. 1
  • Figure 3: Semantic correspondence in V-CLIP space.
  • Figure 4: PLMs for skill sequence generator: (a) PLM fine-tuning case; the x-axis denotes the gradient update steps, and the y-axis denotes the accuracy of the skill sequence generation. (b) PLM zero-shot case; only an engineered prompt is adopted without any model fine-tuning.
  • Figure 5: The left graph represents the difference in expert action distribution based on vehicle configurations. The right figure shows a semantic skill sequence to reach the goal.
  • ...and 8 more figures