Table of Contents
Fetching ...

Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

Junyi Chen, Di Huang, Weicai Ye, Wanli Ouyang, Tong He

TL;DR

Generative Spatial Transformer is presented, a novel auto-regressive framework that jointly addresses spatial localization and view prediction that simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.

Abstract

Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like "Where am I?" and "What will I see?". While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present Generative Spatial Transformer (GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction. The proposed innovative camera tokenization method enables the model to learn the joint distribution of 2D projections and their corresponding spatial perspectives in an auto-regressive manner. This unified training paradigm demonstrates that joint optimization of pose estimation and novel view synthesis leads to improved performance in both tasks, for the first time, highlighting the inherent relationship between spatial awareness and visual prediction.

Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

TL;DR

Generative Spatial Transformer is presented, a novel auto-regressive framework that jointly addresses spatial localization and view prediction that simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.

Abstract

Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like "Where am I?" and "What will I see?". While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present Generative Spatial Transformer (GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction. The proposed innovative camera tokenization method enables the model to learn the joint distribution of 2D projections and their corresponding spatial perspectives in an auto-regressive manner. This unified training paradigm demonstrates that joint optimization of pose estimation and novel view synthesis leads to improved performance in both tasks, for the first time, highlighting the inherent relationship between spatial awareness and visual prediction.

Paper Structure

This paper contains 21 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We introduce Generative Spatial Transformer (GST), which has unified the image representation and camera pose representation within the realm of 3D vision, enabling the autoregressive generation of results in another modality given an observed image and a specific modality.
  • Figure 2: Illustration of the GST . Upon providing an observed image, task category, and the target camera position or the other view image, GST autonomously generates the desired outcome in an auto-regressive manner. The training process of GST comprises two significant phases: (1) training of the image and camera tokenizer, (2) training of the auto-regressive model.
  • Figure 3: Comparison of target distributions for different tasks. Previous methods have traditionally constructed unimodal target distributions for novel view synthesis and camera estimation tasks. However, GST has introduced a joint distribution for both the image and the corresponding camera poses.
  • Figure 4: The impact of joint distribution training. (a) By transforming two distributions into a joint distribution, the target of the model can be unified. (b) (c) We trained using the same initial model parameters and hyperparameters, averaging the loss and gradient norms every $50$ steps, and observed that joint distribution training yields a more stable training trajectory compared to alternatively training two objectives.
  • Figure 5: The comparison of Attention Mechanisms. (a) We standardize the token sequences $s$ of two tasks to the same length. (b) In the conventional alternating training scheme, target modality tokens can attend to all preceding conditional tokens. (c) We propose to model the joint distribution such that, in (a), the token sequences for the two tasks have visibility only to the currently observation tokens $o$.
  • ...and 7 more figures