Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping
Justin Lazarow, Kai Kang, Afshin Dehghan
TL;DR
Rooms from Motion (RfM) presents an object-centric, un-posed indoor localization and mapping framework that uses metric 3D oriented boxes as the sole geometric primitive. It replaces traditional 2D keypoint or dense volume approaches with a learned 3D box matcher to establish cross-frame correspondences, derives relative poses from matched boxes via 4-DoF alignment, and builds global camera poses and semantic 3D object tracks, optionally refining them with a partial bundle adjustment. The method yields a global, sparse, and semantically rich map whose size scales with the number of objects rather than the scene geometry, and it demonstrates strong localization and map quality on CA-1M and ScanNet++ in both un-posed and pose-assisted settings. These results indicate that object-level representations can underpin accurate 3D localization and mapping, offering a compact alternative to dense point-based or volumetric methods, with particular strengths in RGB-only and un-posed scenarios.
Abstract
We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.
