Table of Contents
Fetching ...

3DRot: Rediscovering the Missing Primitive for RGB-Based 3D Augmentation

Shitian Yang, Deyu Li, Xiaoke Jiang, Lei Zhang

TL;DR

3DRot tackles the scarcity of robust RGB-based 3D augmentations by introducing a depth-free, geometry-faithful rotation about the camera's optical center. It derives a closed-form projective mapping, $H_A = K_A R_{AB} K_B^{-1}$, to warp RGB images while consistently updating intrinsics and 3D annotations, enabling depth-free preservation of 2D–3D relationships. The method yields consistent improvements across monocular 3D detection, monocular depth estimation, and LiDAR+RGB 3D detection on SUN RGB-D, NYU Depth v2, KITTI, and cross-domain splits, demonstrating its generality and practicality. As a simple plug-and-play primitive, 3DRot enhances data diversity without scene reconstruction, potentially boosting robustness to viewpoint changes in real-world 3D perception systems. The work highlights a practical path toward richer RGB-based 3D augmentation and cross-modal consistency in multi-sensor setups.

Abstract

RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since many image transforms, including rotations and warps, disrupt geometric consistency. While horizontal flipping and color jitter are standard, rigorous 3D rotation augmentation has surprisingly remained absent from RGB-based pipelines, largely due to the misconception that it requires scene depth or scene reconstruction. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera's optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry, achieving geometry-consistent rotations and reflections without relying on any scene depth. We first validate 3DRot on a classical RGB-based 3D task, monocular 3D detection. On SUN RGB-D, inserting 3DRot into a frozen DINO-X + Cube R-CNN pipeline raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11; smaller but consistent gains appear on a cross-domain IN10 split. Beyond monocular detection, adding 3DRot on top of the standard BTS augmentation schedule further improves NYU Depth v2 from 0.1783 to 0.1685 in abs-rel (and 0.7472 to 0.7548 in $δ<1.25$), and reduces cross-dataset error on SUN RGB-D. On KITTI, applying the same camera-centric rotations in MVX-Net (LiDAR+RGB) raises moderate 3D AP from about 63.85 to 65.16 while remaining compatible with standard 3D augmentations.

3DRot: Rediscovering the Missing Primitive for RGB-Based 3D Augmentation

TL;DR

3DRot tackles the scarcity of robust RGB-based 3D augmentations by introducing a depth-free, geometry-faithful rotation about the camera's optical center. It derives a closed-form projective mapping, , to warp RGB images while consistently updating intrinsics and 3D annotations, enabling depth-free preservation of 2D–3D relationships. The method yields consistent improvements across monocular 3D detection, monocular depth estimation, and LiDAR+RGB 3D detection on SUN RGB-D, NYU Depth v2, KITTI, and cross-domain splits, demonstrating its generality and practicality. As a simple plug-and-play primitive, 3DRot enhances data diversity without scene reconstruction, potentially boosting robustness to viewpoint changes in real-world 3D perception systems. The work highlights a practical path toward richer RGB-based 3D augmentation and cross-modal consistency in multi-sensor setups.

Abstract

RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since many image transforms, including rotations and warps, disrupt geometric consistency. While horizontal flipping and color jitter are standard, rigorous 3D rotation augmentation has surprisingly remained absent from RGB-based pipelines, largely due to the misconception that it requires scene depth or scene reconstruction. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera's optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry, achieving geometry-consistent rotations and reflections without relying on any scene depth. We first validate 3DRot on a classical RGB-based 3D task, monocular 3D detection. On SUN RGB-D, inserting 3DRot into a frozen DINO-X + Cube R-CNN pipeline raises from 43.21 to 44.51, cuts rotation error (ROT) from 22.91 to 20.93, and boosts from 35.70 to 38.11; smaller but consistent gains appear on a cross-domain IN10 split. Beyond monocular detection, adding 3DRot on top of the standard BTS augmentation schedule further improves NYU Depth v2 from 0.1783 to 0.1685 in abs-rel (and 0.7472 to 0.7548 in ), and reduces cross-dataset error on SUN RGB-D. On KITTI, applying the same camera-centric rotations in MVX-Net (LiDAR+RGB) raises moderate 3D AP from about 63.85 to 65.16 while remaining compatible with standard 3D augmentations.

Paper Structure

This paper contains 45 sections, 28 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Overall concept of 3DRot. We rotate images about the camera's optical center and synchronously update intrinsics, poses, and labels to preserve projective geometry. In each panel, the left subfigure is a concept sketch: red denotes the 3D bounding box, blue the screen border, and yellow the projected 3D box on the screen. Panels: top-left (origin), top-right (yaw $50^\circ$), bottom-left (roll $20^\circ$), bottom-right (pitch $20^\circ$).
  • Figure 2: Depth maps. The block is a $2\times2$ grid shown in row-major order with camera rotations about the optical center: $(0^\circ,0^\circ,0^\circ)$, yaw $+40^\circ$, pitch $+20^\circ$, and roll $+20^\circ$. In all cases 3DRot applies the same pure-rotation homography to the RGB image and updates labels/intrinsics accordingly, so the depth remains 2D–3D consistent while the image footprint changes. See Supplementary Video for an animated demo.
  • Figure 3: Left: original image; right: view rotated about the optical center with roll $+30^\circ$. Insets: the ray-imaging diagram (top-right) rotates the image plane about the optical center while keeping the viewing rays fixed, and the pose-imaging diagram (bottom-left) rotates the camera frame so that 3D labels remain valid. The resulting RGB on the right preserves 2D–3D geometric consistency while only the screen footprint changes. See Supplementary Video for an animated demo.
  • Figure 4: NOCS (R/G/B encode X/Y/Z) maps under camera-centric rotations about the optical center. The block is a $2\times2$ grid shown in row-major order with camera rotations: $(0^\circ,0^\circ,0^\circ)$, yaw $+40^\circ$, pitch $+20^\circ$, and roll $+20^\circ$. In all cases 3DRot applies the same pure-rotation homography to the RGB image and updates labels/intrinsics accordingly, so the depth and NOCS remain 2D--3D consistent while the image footprint changes.