Table of Contents
Fetching ...

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska

TL;DR

The paper tackles embodied cognition in robots by enabling spatial understanding through Visual Perspective Taking (VPT) within Human-Robot Interaction (HRI) using Vision-Language Systems. It introduces a synthetic, ground-truth-rich framework where a Vision-Language System maps RGB images and language prompts to ground-truth $^{CAM}T_{OBJ}$ and reasons about how viewpoints transform, starting with inferring the Z-axis distance. The key contributions include a minimal synthetic dataset generated in NVIDIA Omniverse Replicator that provides supervised spatial relations and is publicly released to support future expansion toward full $6$-DOF reasoning. The work offers a scalable, controllable testbed for training embodied AI capable of spatial understanding in HRI, with practical impact on perspective-aware interactions.

Abstract

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

TL;DR

The paper tackles embodied cognition in robots by enabling spatial understanding through Visual Perspective Taking (VPT) within Human-Robot Interaction (HRI) using Vision-Language Systems. It introduces a synthetic, ground-truth-rich framework where a Vision-Language System maps RGB images and language prompts to ground-truth and reasons about how viewpoints transform, starting with inferring the Z-axis distance. The key contributions include a minimal synthetic dataset generated in NVIDIA Omniverse Replicator that provides supervised spatial relations and is publicly released to support future expansion toward full -DOF reasoning. The work offers a scalable, controllable testbed for training embodied AI capable of spatial understanding in HRI, with practical impact on perspective-aware interactions.

Abstract

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Paper Structure

This paper contains 2 sections.

Table of Contents

  1. Introduction
  2. Method