Table of Contents
Fetching ...

FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

Sabrina McCallum, Amit Parekh, Alessandro Suglia

TL;DR

The paper investigates how language feedback can unlock data-efficient learning from suboptimal demonstrations in embodied vision-language tasks. It introduces FOSSIL, a Transformer-based imitation-learning framework that conditions action generation on language feedback and optional self-supervised feedback prediction, trained on a mixture of optimal and suboptimal trajectories in the controllable BabyAI-XGen environment. Empirical results show language feedback can match scalar rewards in driving compositional generalisation and robustness, with gains when combined with rewards and the auxiliary feedback-prediction task, demonstrating improved data efficiency and resilience to perturbations. The work highlights the practical potential of language-driven feedback as an intuitive alternative to scalar rewards and provides a scalable framework for future exploration of feedback-based learning in more realistic embodied AI settings.

Abstract

Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents' compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.

FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

TL;DR

The paper investigates how language feedback can unlock data-efficient learning from suboptimal demonstrations in embodied vision-language tasks. It introduces FOSSIL, a Transformer-based imitation-learning framework that conditions action generation on language feedback and optional self-supervised feedback prediction, trained on a mixture of optimal and suboptimal trajectories in the controllable BabyAI-XGen environment. Empirical results show language feedback can match scalar rewards in driving compositional generalisation and robustness, with gains when combined with rewards and the auxiliary feedback-prediction task, demonstrating improved data efficiency and resilience to perturbations. The work highlights the practical potential of language-driven feedback as an intuitive alternative to scalar rewards and provides a scalable framework for future exploration of feedback-based learning in more realistic embodied AI settings.

Abstract

Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents' compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.

Paper Structure

This paper contains 68 sections, 12 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Our method leverages both optimal and suboptimal trajectories for a given task instance by contextualising modes of behaviour with feedback signals. We leverage different types of feedback and additional self-supervised auxiliary tasks to learn highly generalisable and robust representations of behaviour in a data-efficient manner.
  • Figure 2: Input and output tokens for a model conditioning action generation on initial instructions and language feedback, with the option to predict language feedback at the next time step. $m_i$=instructions, $f_i$=language feedback, $r_i$=returns-to-go/rewards, $o_i$=observations, $a_i$=actions.
  • Figure 3: Optimal trajectories generated by a planner, and suboptimal trajectories obtained by replacing planner actions with random actions, which the planner must correct if necessary. The given p(r) is exemplary.
  • Figure 4: Comparison of success rates under various robustness evaluation settings. We report robustness results on in-distribution data and average over Pickup and PutNext tasks. From left to right: a) representations of subgoals, b) external perturbations, c) adversarial feedback, d) missing feedback. b)-d) correspond to alternative inference scenarios. ST=suboptimal trajectories, FP=feedback prediction.
  • Figure 5: Change in success rate on Systematicity tasks as the amount of training data increases, averaged over Pickup and PutNext tasks. ST=suboptimal trajectories
  • ...and 5 more figures