SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

Sangwoo Shin; Minjong Yoo; Jeongwoo Lee; Honguk Woo

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

Sangwoo Shin, Minjong Yoo, Jeongwoo Lee, Honguk Woo

TL;DR

SemTra tackles cross-domain zero-shot policy adaptation for long-horizon tasks by introducing a two-phase framework that first translates interleaved multi-modal prompts into semantic skill sequences using multi-modal skill encoders and a PLM-based semantic skill decoder, then instantiates these skills as executable actions through domain-context and online-context encoders coupled with a behavior decoder. The approach demonstrates robust cross-domain transfer across robotic manipulation and autonomous driving environments (e.g., Franka Kitchen, Meta-World, RLBench, CARLA), outperforming strong baselines in both task- and skill-level evaluations and showing meaningful semantic alignment in the multi-modal space. These results highlight SemTra’s potential to enable zero-shot deployment of complex policies in safety-critical settings by leveraging semantic reasoning and context-aware skill instantiation. The work offers practical implications for cognitive robots and autonomous systems that must interpret abstract instructions and adapt to varied configurations without extensive domain-specific fine-tuning.

Abstract

This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

TL;DR

Abstract

Paper Structure (57 sections, 12 equations, 13 figures, 22 tables, 2 algorithms)

This paper contains 57 sections, 12 equations, 13 figures, 22 tables, 2 algorithms.

Introduction
Problem Formulation
Approach
Overall Framework
Task Adaptation
Multi-modal skill encoders.
Semantic skill decoder.
Skill Adaptation
Domain context encoder.
Behavior decoder.
Evaluation
Experiment Setting
Environments.
Baselines.
Evaluation metrics.
...and 42 more sections

Figures (13)

Figure 1: Cross-domain zero-shot adaptation for a multi-modal task prompt: our framework is given a task prompt filled with partial demonstrations and instructed contextual cues in multi-modal snippets. The framework conducts a two-phase adaptation, initially translating the snippets to semantic skills at the task level, and subsequently optimizing them into actions for the target domain at the skill level.
Figure 2: Two-phase policy adaptation in $\textnormal{SemTra}$. (1) Task adaptation: the multi-modal skill encoders $\Phi_E$ produce a skill-level language instruction from the task prompt. In the figure, we specifically describe the training of a video skill encoder, contrastively learned through a pretrained VLM $(\Psi_V, \Psi_L)$. The skill-level instruction is then translated into a semantic skill sequence through the semantic skill sequence generator $\Phi_G$ based on a PLM. The skill boundary detector $\Phi_B$ infers the boundary of semantic skills upon a current state in the target domain. (2) Skill adaptation: the context encoder $\Phi^{(g)}_C$ identifies instructed domain contexts using both the skill-level instruction and the semantic skill sequence, generating an executable skill sequence. The online context encoder $\Phi^{(o)}_C$ captures environment hidden contexts at runtime. The behavior decoder $\pi$ generates actions optimized for the target domain, based on the executable skills and environment hidden contexts, along with the current state. 1
Figure 3: Semantic correspondence in V-CLIP space.
Figure 4: PLMs for skill sequence generator: (a) PLM fine-tuning case; the x-axis denotes the gradient update steps, and the y-axis denotes the accuracy of the skill sequence generation. (b) PLM zero-shot case; only an engineered prompt is adopted without any model fine-tuning.
Figure 5: The left graph represents the difference in expert action distribution based on vehicle configurations. The right figure shows a semantic skill sequence to reach the goal.
...and 8 more figures

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

TL;DR

Abstract

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)