Table of Contents
Fetching ...

Joint Action Language Modelling for Transparent Policy Execution

Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi, Angelo Cangelosi

TL;DR

This work tackles the opacity of autonomous robotic policies by jointly generating the next-action tokens and a transparent natural-language statement within an autoregressive framework. By recasting policy learning as a language-generation task and grounding it in Vision-Language Models, the approach produces interpretable descriptions alongside action tokens, enabling transparent decision-making. The method leverages RoboVQA for pretraining and Language-Table for evaluation, using discretized action/state tokens embedded in prompts. Results indicate that simultaneous generation of language and actions improves both trajectory quality and language coherence, with output order, state-context, and pretraining modulating the gains, and reveal the need for robust metrics to assess true transparency and semantic alignment.

Abstract

An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

Joint Action Language Modelling for Transparent Policy Execution

TL;DR

This work tackles the opacity of autonomous robotic policies by jointly generating the next-action tokens and a transparent natural-language statement within an autoregressive framework. By recasting policy learning as a language-generation task and grounding it in Vision-Language Models, the approach produces interpretable descriptions alongside action tokens, enabling transparent decision-making. The method leverages RoboVQA for pretraining and Language-Table for evaluation, using discretized action/state tokens embedded in prompts. Results indicate that simultaneous generation of language and actions improves both trajectory quality and language coherence, with output order, state-context, and pretraining modulating the gains, and reveal the need for robust metrics to assess true transparency and semantic alignment.

Abstract

An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

Paper Structure

This paper contains 22 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Method Overview. We utilize the Vision-Language Model PaliGemma to produce a transparent statement and action tokens given an input prompt, describing the current task, and the visual observation of the environment. The model is pretrained on visual question answering in robotic settings. We discretize the state and action vectors into special tokens to embed these directly into the input prompt and target strings.
  • Figure 2: Comparison between different orders of joint output: action tokens before or after the language statement.
  • Figure 3: Sample outputs of our model on our test set including positive and negative samples. We removed the surrounding prompt-specific tokens for readability.
  • Figure 4: Effects of including the tokenized state vector in the input prompt.
  • Figure 5: Effects of varying action tokenization resolutions on action quality.