Table of Contents
Fetching ...

Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

Justin Lazarow, Kai Kang, Afshin Dehghan

TL;DR

Rooms from Motion (RfM) presents an object-centric, un-posed indoor localization and mapping framework that uses metric 3D oriented boxes as the sole geometric primitive. It replaces traditional 2D keypoint or dense volume approaches with a learned 3D box matcher to establish cross-frame correspondences, derives relative poses from matched boxes via 4-DoF alignment, and builds global camera poses and semantic 3D object tracks, optionally refining them with a partial bundle adjustment. The method yields a global, sparse, and semantically rich map whose size scales with the number of objects rather than the scene geometry, and it demonstrates strong localization and map quality on CA-1M and ScanNet++ in both un-posed and pose-assisted settings. These results indicate that object-level representations can underpin accurate 3D localization and mapping, offering a compact alternative to dense point-based or volumetric methods, with particular strengths in RGB-only and un-posed scenarios.

Abstract

We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

TL;DR

Rooms from Motion (RfM) presents an object-centric, un-posed indoor localization and mapping framework that uses metric 3D oriented boxes as the sole geometric primitive. It replaces traditional 2D keypoint or dense volume approaches with a learned 3D box matcher to establish cross-frame correspondences, derives relative poses from matched boxes via 4-DoF alignment, and builds global camera poses and semantic 3D object tracks, optionally refining them with a partial bundle adjustment. The method yields a global, sparse, and semantically rich map whose size scales with the number of objects rather than the scene geometry, and it demonstrates strong localization and map quality on CA-1M and ScanNet++ in both un-posed and pose-assisted settings. These results indicate that object-level representations can underpin accurate 3D localization and mapping, offering a compact alternative to dense point-based or volumetric methods, with particular strengths in RGB-only and un-posed scenarios.

Abstract

We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

Paper Structure

This paper contains 35 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Rooms from Motion realizes an object-centric framework for metric localization and semantic 3D object-level mapping from un-posed RGB images without the need for explicit 2D keypoints or point clouds. Given an unordered collection of images, Rooms from Motion detects every object as a metric 3D box within each image, uses a learned object matcher to associate objects across frames, estimates relative poses using the 3D boxes of matched objects, and finally estimates absolute camera poses and forms global, semantic 3D object tracks (akin to 3D object detection). Above: We visualize the semantics-aware map and camera localization of Rooms from Motion from un-posed RGB images on two challenging ScanNet++ scenes: a large laboratory and a few rooms within a residential space. Below: We show class-agnostic results on an open space from the CA-1M dataset.
  • Figure 2: An overview of Rooms from Motion. Two-view geometry is considered first. Semantic objects as 3D boxes are independently detected in individual frames using CuTR (Section \ref{['sec:detecting_objects']}). Objects are subsequently matched at the 3D box level (Sec. \ref{['sec:matching_objects']}) and at the implicit 3D box corner level (Sec. \ref{['sec:match_corners']}). Relative pose can be estimated using the matched objects and corners between the images as point sets (Sec. \ref{['sec:relpose_estimation']}). Averaging operates on the estimated relative poses to produce global camera poses (Sec. \ref{['sec:averaging']}). Given the set of all matched objects, we establish object tracks and use each observation to aid in estimating the global 3D box and semantics for each track. Finally, duplicate tracks are subsequently merged or suppressed. Bottom right: We show lifted 3D box observations belonging to each object track by color as well as the final representative 3D boxes (red) for each track.
  • Figure 3: Bundle adjustment can be extended to a global 3D object track. Each observation of a track is characterized by an image and particular 3D box detection from CuTR. Since these individual 3D boxes (shown as thin, white boxes) are usually aligned well with the RGB image, they can be used to enforce a reprojection cost function by comparing the projected corners of these detections to the projected corners of the object track's representative 3D box. The point tracks established in Section \ref{['sec:match_corners']} (shown in 8 colors) allow us to map the corners of these individual 3D boxes to the corners of the track's global 3D box. In the rightmost pane, we show an initial mismatch of the projected corners from the individual detection (brighter color) and the object track (darker color) that may be minimized by optimizing the track's 3D box parameters.
  • Figure 4: Qualitative comparisons from CA-1M (top half) and ScanNet++ (bottom half) of resulting 3D object maps across different methods as projected to the viewpoint of a FARO scanner. We note that RfM results here are unposed (i.e., using aligned, estimated pose), while the 3D object detection-based methods rely on a priori pose.