Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

Aditya Prakash; Arjun Gupta; Saurabh Gupta

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

Aditya Prakash, Arjun Gupta, Saurabh Gupta

TL;DR

Intrinsics-Aware Positional Encoding (KPE) is proposed, which incorporates information about the location of crops in the image and camera intrinsics and shows the benefits of KPE on three popular 3D-from-a-single-image benchmarks.

Abstract

Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

TL;DR

Abstract

Paper Structure (11 sections, 7 figures, 6 tables)

This paper contains 11 sections, 7 figures, 6 tables.

Introduction
Related Work
Parallelepipeds Case Study
Intrinsics-Aware Positional Encoding (KPE)
Using KPE
Experiments
Application 1: 3D Pose of Articulated Objects in Contact Fan2023CVPR
Application 2: Dense Metric Depth Prediction silberman2012indoorbhat2023zoedepth
Application 3: 3D Object Detection on KITTI, nuScenes brazil2023omni3dgeiger2012weCaesar2020CVPR
Discussion
Conclusion

Figures (7)

Figure 1: Perspective Distortion-induced Shape Ambiguity. Consider two circles of the same size undergoing perspective projection under a pinhole camera. Even though they are at different distances from the camera, they appear to be the same size in the image due to perspective distortion. A model (e.g. a neural network) that predicts the distances of these circles from the camera based purely on the appearance of the image crops, without taking into account their location in the camera's field of view, will fail at this task. We call this the Perspective Distortion-induced Shape Ambiguity in Image Crops or PSAC. In this work, we propose an encoding to incorporate the crop location in the camera's field of view as input and show its effectiveness on metric depth prediction, 3D object detection & 3D pose estimation of articulated objects (Sec. \ref{['sec:dummy-experiments']}).
Figure 2: Parallelepipeds Case Study. Figure (a) plots the root relative 3D keypoint error vs. 2D keypoint error of different parallelepipeds placed at different locations in the camera's field of view w.r.t. a reference cuboid shown in the green box in (a.2). Points are color coded with the distance between the parallelepipeds crops. As we let crops go farther away (the red points), we start finding parallelepipeds that have very different 3D shape but happen to project such that their 2D keypoints look the same as the reference cuboid. Figure (b) shows a similar plot but for absolute 3D keypoint error. Figure (c) shows these ambiguous 3D parallelepipeds in the top row and their renderings in the bottom row. The 1st figure in the green box is the reference w.r.t. which we measure 2D and 3D keypoint errors.
Figure 3: Predicting root relative (left) or absolute (right) 3D shape from 2D image crops fails in the absence of information about the location of the crop in camera's field of view. Training loss saturates at a high value because of the inherent ambiguity. Adding information about the location of the crop in camera's field of view alleviates this ambiguity, leading to better metrics for both root-relative and absolute 3D prediction.
Figure 4: Intrinsics-Aware Positional Encodings (KPE). (a) For each pixel in the image, we compute its position in the camera's field of view ($\theta_x$ and $\theta_y$), or the angular distance that the pixel makes with respect to the principal point and the camera origin (as shown on the left). Note that both $\theta_x$ and $\theta_y$ are sensitive to the camera intrinsic parameters. (b) For a dense prediction task, we make use of a dense prediction encoding which contains the positional encoding for each pixel in the region of interest. (c) For other tasks, we simply represent the positions of the corners of the relevant region of interest in addition to the center point. The positional encoding can be passed into the network at the input level or concatenated to some intermediate representation, this design choice is made separately for each task.
Figure 5: 3D pose visualizations on ARCTIC. Our proposed modification of intrinsics-aware positional encoding (KPE) improves over the ArcticNet-SF Fan2023CVPR model by predicting better 3D poses in interaction scenarios (note the difference in the articulation angle and global pose). For each image, we show the projection of the object mesh with the predicted pose on the image and from 2 different camera views.
...and 2 more figures

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

TL;DR

Abstract

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops

Authors

TL;DR

Abstract

Table of Contents

Figures (7)