Table of Contents
Fetching ...

PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images

Yiheng Xiong, Angela Dai

TL;DR

A transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object, which outperforms state of the art in both realistic and realistic scenarios.

Abstract

Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios.

PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images

TL;DR

A transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object, which outperforms state of the art in both realistic and realistic scenarios.

Abstract

Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios.
Paper Structure (22 sections, 4 equations, 14 figures, 5 tables)

This paper contains 22 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Our approach seeks to model the distribution of 3D shapes in latent space conditioned on a single highly ambiguous image, enabling the sampling of multiple diverse hypotheses during inference. Colored points are for visualization purposes only, indicating the overlap between ground-truth visible parts and plausible shapes.
  • Figure 2: Method Overview. Our model adopts a low-dimensional and discretized latent space for shape generation. Given an input RGB image, it is first processed by an image encoder. Then the extracted image encodings are fed into the conditional module where they attend with the input sequence consisting of a start token, potentially filled grids containing latent features, and a target grid containing its position embedding via cross-attention such that the output sequence contains essential image features. Then the transformer processes this sequence and predicts the latent feature for the target grid. Conditional cross-attention and transformer are adopted autoregressively until a complete sequence is output where each cell contains the conditional distribution of latent features. During inference, we can gather the complete sequence into grids then sample and decode to multiple plausible hypotheses.
  • Figure 3: Identifying Most Relevant Regions in the Image via Conditional Cross-Attention. We perform cross-attention between image encodings and the input sequence to effectively identify the most relevant region of interest from the input image. The input sequence consisting of a start token, potentially filled locations with their own predicted latent features, and a target location with its position embedding, is multiplied with a weight matrix $Q$ and the image encodings are multiplied with weight matrices $V$ and $K$ independently. The resulting sequence contains a weighted sum of image features. We then project it to the original dimension and concatenate with the original sequence embeddings.
  • Figure 4: Different Types of Ambiguity. (a) indicates ambiguity caused by occlusion. (b) and (c) represent ambiguity resulted from truncation via limited field-of-view of the camera.
  • Figure 5: Ground-Truth Mapping. (a) represents a sample rendering and (b), (c) and (d) represent multiple distinct ground-truth shapes that corresponds to (a). Yellow points are for visualization purposes only, indicating the overlap between ground-truth visible parts and multiple plausible shapes.
  • ...and 9 more figures