Table of Contents
Fetching ...

Incorporating simulated spatial context information improves the effectiveness of contrastive learning models

Lizhen Zhu, James Z. Wang, Wonseuk Lee, Brad Wyble

TL;DR

This work presents a unique approach, termed environmental spatial similarity (ESS), that complements existing contrastive learning methods and has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics.

Abstract

Visual learning often occurs in a specific context, where an agent acquires skills through exploration and tracking of its location in a consistent environment. The historical spatial context of the agent provides a similarity signal for self-supervised contrastive learning. We present a unique approach, termed Environmental Spatial Similarity (ESS), that complements existing contrastive learning methods. Using images from simulated, photorealistic environments as an experimental setting, we demonstrate that ESS outperforms traditional instance discrimination approaches. Moreover, sampling additional data from the same environment substantially improves accuracy and provides new augmentations. ESS allows remarkable proficiency in room classification and spatial prediction tasks, especially in unfamiliar environments. This learning paradigm has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics. Potentially transformative applications span from robotics to space exploration. Our proof of concept demonstrates improved efficiency over methods that rely on extensive, disconnected datasets.

Incorporating simulated spatial context information improves the effectiveness of contrastive learning models

TL;DR

This work presents a unique approach, termed environmental spatial similarity (ESS), that complements existing contrastive learning methods and has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics.

Abstract

Visual learning often occurs in a specific context, where an agent acquires skills through exploration and tracking of its location in a consistent environment. The historical spatial context of the agent provides a similarity signal for self-supervised contrastive learning. We present a unique approach, termed Environmental Spatial Similarity (ESS), that complements existing contrastive learning methods. Using images from simulated, photorealistic environments as an experimental setting, we demonstrate that ESS outperforms traditional instance discrimination approaches. Moreover, sampling additional data from the same environment substantially improves accuracy and provides new augmentations. ESS allows remarkable proficiency in room classification and spatial prediction tasks, especially in unfamiliar environments. This learning paradigm has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics. Potentially transformative applications span from robotics to space exploration. Our proof of concept demonstrates improved efficiency over methods that rely on extensive, disconnected datasets.
Paper Structure (33 sections, 7 equations, 8 figures, 8 tables)

This paper contains 33 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The impact of position on the appearance, lighting, and camera distance/focal length of an image (A) The perspective of a room can greatly impact its appearance when rendered from different positions in the ThreeDWorld simulated environment. (B) The natural lighting of a scene can significantly alter its appearance when captured at different times of the day. Photos courtesy of Federico Adolfi. (C) The head and facial features of a statue may appear differently when captured with different focal lengths. Photos courtesy of James Z. Wang.
  • Figure 2: The simulated environments and the trajectories used by the embodied agent to generate the datasets (A) The Archviz House. (B) The Apartment. (C-E) The trajectories for House14K, Apt14K, and House100K, respectively. (F-G) Three example images from the House and Apt environments, respectively. During training, random batches were sampled from these trajectories. Images were considered similar if they were spatially close to each other.
  • Figure 3: Illustration of a more comprehensive approach to evaluating spatial similarity, which considers not only the distance and angle between two views but also the specific region of space being observed Even though the angular difference between the two views generated in (A) and (B), calculated as $\lvert \theta_1-\theta_2 \lvert$, and the position difference are equivalent, the two views in (A) could be considered more similar due to their convergent perspective and shared focus on a specific region of space. In contrast, the views in (B) may be considered less similar due to their divergent perspective and lack of overlap in the region of space being observed.
  • Figure 4: Illustration of representative lighting conditions available in ThreeDWorld A total of 95 lighting conditions are shown here, distributed according to a cluster analysis based on pixel values of three example images captured in the House environment using the t-SNE algorithm. From the total collection of skyboxes, nine were selected to cover this space. For each selected skybox, an example image, taken from an identical viewpoint within the house, is shown. From left to right and top to bottom, the skyboxes' names are as follows: Kiara_1_dawn, Ninomaru_teien, Small_hangar_01, Venice_sunrise, Blue_grotto, Whipple_creek_gazebo, Mosaic_tunnel, Royal_esplanade, and Indoor_pool.
  • Figure 5: The proposed ESS-MB approach The learning algorithm compares a given image against the $N$ images in the dictionary, using their spatial position and rotation information to find positive pairs by comparing their relative spatial position and rotation values against a given threshold. The feature values of all images within the dictionary are then compared to compute the loss value relative to whether each image is part of a positive pair. This loss value is used to drive gradient descent as in the original MoCo formulation.
  • ...and 3 more figures