The Wallpaper is Ugly: Indoor Localization using Vision and Language

Seth Pate; Lawson L. S. Wong

The Wallpaper is Ugly: Indoor Localization using Vision and Language

Seth Pate, Lawson L. S. Wong

TL;DR

This work learns a similarity score between text descriptions and images of locations in the environment that allows it to identify locations that best match the language query, estimating the user’s location.

Abstract

We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment. Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment. This score allows us to identify locations that best match the language query, estimating the user's location. Our approach is capable of localizing on environments, text, and images that were not seen during training. One model, finetuned CLIP, outperformed humans in our evaluation.

The Wallpaper is Ugly: Indoor Localization using Vision and Language

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 7 figures, 3 tables)

This paper contains 18 sections, 1 equation, 7 figures, 3 tables.

Introduction
Related Work
Matterport3D Environment
Task and Architecture
Data
Human ('gold') set
Room Across Room ('RxR')
Room Across Room 'landmarks'
Evaluation and Results
Metrics
Comparing Models and Finetuning
Human Baseline
Limitations, and Future Directions
Human Evaluation
Continuous Environment
...and 3 more sections

Figures (7)

Figure 1: Use of localization in robotics. Localization is a necessary first step when a robot must help a human without perfect knowledge of their location. This may apply to search and rescue (top) or household assistance (bottom). In this paper, we study only the localization task. Photo credits: Ian Howard (top), Matterport3D (bottom).
Figure 2: Vision-language localization. (a) The model encodes the user's description of their location, the goal. (b) The model encodes an exhaustive sample of images representing all locations in the environment. (c) The model produces a similarity score between each image and the description, which, after softmax, outputs a distribution to predict the user's location.
Figure 3: Example Model Output. Our model creates a likelihood distribution across the 170 locations in this scan. The model’s confidence is shown by both the size and color of the circles, which represent views. We highlight some guesses alongside the true target location, a bathroom. From top left, clockwise: (a) The 10th guess is a laundry room with a sink, but no toilet. (b) The 4th guess has a photo and a toilet. It is a good guess, but the wrong bathroom. (c) The 3rd guess was taken from the hallway, but has a clear view of the target location. (d) The model’s best guess is in the same annotated region (room), adjacent to the target. (e) The 2nd guess is in another bathroom without a framed photo, only a mirror which may resemble one.
Figure 4: Model. CLIP radford2021learning uses transformer networks Vaswani2017dosovitskiy2020image to encode text and images into vectors of identical length, then compares these vectors by taking their dot product. In our task, a description might be compared with as many as 170 images (views) from the environment (scan).
Figure 5: (a) The Matterport3D (m3d) 360° RGBD set (red) contains most, but not all, of the images in Matterport3D (purple). We used the former for its equirectangular format. (b) The human 'gold’ test set (orange) is disjoint from the finetuning sets RxR and RxR_landmarks (beige). Both are subsets of the m3d 360° RGBD set.
...and 2 more figures

The Wallpaper is Ugly: Indoor Localization using Vision and Language

TL;DR

Abstract

The Wallpaper is Ugly: Indoor Localization using Vision and Language

Authors

TL;DR

Abstract

Table of Contents

Figures (7)