Table of Contents
Fetching ...

"Where am I?" Scene Retrieval with Language

Jiaqi Chen, Daniel Barath, Iro Armeni, Marc Pollefeys, Hermann Blum

TL;DR

Text2SceneGraphMatcher is presented, a pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match, and defines this task as language-based scene-retrieval, which is closely related toarse-localization.

Abstract

Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match. The code, trained models, and datasets will be made public.

"Where am I?" Scene Retrieval with Language

TL;DR

Text2SceneGraphMatcher is presented, a pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match, and defines this task as language-based scene-retrieval, which is closely related toarse-localization.

Abstract

Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match. The code, trained models, and datasets will be made public.
Paper Structure (15 sections, 4 equations, 3 figures, 6 tables)

This paper contains 15 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Pipeline visualization. Given an open-set natural language query (left, red) and a reference map of environments represented by a set of 3D scene graphs (right, yellow), we establish text-to-scene-graph correspondences. The text and scene graph correspondences are matched according to their embeddings in a joint embedding space (blue). These embeddings are jointly learned by a joint embedding model (green). Additionally, the text-query content in brackets represent potential downstream applications for our system, they are not part of the scene description.
  • Figure 2: Example of Scene and Text Graphs. In the top left is an image of a living room scene. The bottom left is the corresponding semantic scene graph, where nodes represent the objects, and edges represent a spatial relationship such as "contains," "in front of," or "on." The top right section shows a text description of the scene, from the human-annotated dataset. The bottom right figure shows the corresponding "text-graph" of the scene description. The left hand side represents our database of scene graphs, while the right hand side represents an incoming text-query, which we first transform into a "text-graph," and then match with the scene graph.
  • Figure 3: Pipeline Overview. The input to our method is a text-query and a 3D scene graph potentially matching the text. Next, we independently process these inputs to obtain graphs with word2vec embeddings as nodes for the objects. Our network then performs self- and cross-attention with a final average pooling layer to obtain embeddings. These embeddings are concatenated, and a matching probability is predicted by a Multi-Layer Perceptron (MLP). The joint embedding model is trained by jointly optimizing a matching loss and cosine similarity loss.