Cubify Anything: Scaling Indoor 3D Object Detection
Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan
TL;DR
The paper tackles indoor 3D object detection from a single RGB(-D) frame by building CA-1M, a large-scale, pixel-perfect dataset with exhaustively labeled 3D boxes rendered to every frame. It introduces Cubify Transformer (CuTR), a fully Transformer-based detector that operates on RGB(-D) inputs without 3D inductive biases, and demonstrates that with CA-1M pretraining CuTR can outperform traditional point-based 3D detectors on diverse benchmarks. Notably, CuTR benefits from large-scale, high-fidelity data and pretraining, achieving strong recall and precision, while also showing robustness to depth noise and maintaining competitive RGB-only performance. The work argues for a data-centric, image-based approach to scaling indoor 3D perception and introduces a practical rendering pipeline to align world-space annotations with frame-level ground truth, enabling effective supervision at scale.
Abstract
We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.
