Table of Contents
Fetching ...

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Yihao Wang, Raphael Memmesheimer, Sven Behnke

TL;DR

LIAM addresses the challenge of predicting action transcripts from language, images, actions, and semantic maps for domestic robotics. It introduces two pre-training schemes—contrastive alignment and triple contrastive alignment—to jointly align vision, language, and action embeddings, complemented by a semantic-map modality fed into a multi-modal Transformer. Evaluations on ALFRED show that pre-alignment and semantic maps significantly improve cross-modal matching and end-to-end action accuracy, with the triple-contrastive approach delivering the strongest gains, especially on unseen tasks. This work demonstrates that cross-modal embedding alignment and semantic-map integration can enhance open-vocabulary robotic instruction understanding and execution in simulated environments, with practical implications for flexible domestic assistance systems.

Abstract

The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

TL;DR

LIAM addresses the challenge of predicting action transcripts from language, images, actions, and semantic maps for domestic robotics. It introduces two pre-training schemes—contrastive alignment and triple contrastive alignment—to jointly align vision, language, and action embeddings, complemented by a semantic-map modality fed into a multi-modal Transformer. Evaluations on ALFRED show that pre-alignment and semantic maps significantly improve cross-modal matching and end-to-end action accuracy, with the triple-contrastive approach delivering the strongest gains, especially on unseen tasks. This work demonstrates that cross-modal embedding alignment and semantic-map integration can enhance open-vocabulary robotic instruction understanding and execution in simulated environments, with practical implications for flexible domestic assistance systems.

Abstract

The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

Paper Structure

This paper contains 12 sections, 10 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Model architecture of LIAM. The blue blocks are all layers that were frozen during the end-to-end model training; the orange blocks are the parts that were trained.
  • Figure 2: An example of the ground truth from one mini-batch for alignment of visual and action embedding space. One action corresponds to two consecutive frames.
  • Figure 3: Ground truth for alignment of visual and language embedding space. One frame sequence corresponds to one language instruction.
  • Figure 4: Example task: "Look at the basketball in the light from the lamp". The given step-by-step instructions are as follows: 'Walk to the foot of the bed.' -> 'Pick up the basketball from the floor.' -> 'Go to the desk to your left.' ->'Turn on the lamp.'