UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

Yuzhong Huang; Chen Liu; Ji Hou; Ke Huo; Shiyu Dong; Fred Morstatter

UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

Yuzhong Huang, Chen Liu, Ji Hou, Ke Huo, Shiyu Dong, Fred Morstatter

TL;DR

UniPlane tackles the challenge of reconstructing structured 3D scenes from monocular video by jointly detecting and reconstructing planar surfaces within a single Transformer-based framework. It builds a sparse 3D feature volume from posed image sequences and uses per-plane embeddings as queries to infer plane instances via voxel-embedding dot products, refined further by differentiable rendering to align with input frames. By propagating plane embeddings across video fragments, the method eliminates the need for a separate tracking/fusion module, enabling end-to-end optimization of geometry and segmentation. On ScanNetv2, UniPlane delivers substantial gains over state-of-the-art baselines in both 3D geometry and segmentation metrics, including a +4.6 in F-score for geometry, demonstrating improved plane detection accuracy and reconstruction quality. The framework offers a compact, end-to-end approach for large-scale plane-based scene understanding from monocular video with potential extension to other primitive geometries such as boxes, spheres, and NURBS surfaces.

Abstract

We present UniPlane, a novel method that unifies plane detection and reconstruction from posed monocular videos. Unlike existing methods that detect planes from local observations and associate them across the video for the final reconstruction, UniPlane unifies both the detection and the reconstruction tasks in a single network, which allows us to directly optimize final reconstruction quality and fully leverage temporal information. Specifically, we build a Transformers-based deep neural network that jointly constructs a 3D feature volume for the environment and estimates a set of per-plane embeddings as queries. UniPlane directly reconstructs the 3D planes by taking dot products between voxel embeddings and the plane embeddings followed by binary thresholding. Extensive experiments on real-world datasets demonstrate that UniPlane outperforms state-of-the-art methods in both plane detection and reconstruction tasks, achieving +4.6 in F-score in geometry as well as consistent improvements in other geometry and segmentation metrics.

UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 5 figures, 4 tables)

This paper contains 20 sections, 3 equations, 5 figures, 4 tables.

Introduction
Related work
Single-view plane reconstruction
Multi-view plane reconstruction
Learning-based tracking and reconstruction
Methods
Feature volume construction
Transformers-based Plane Detection
Unifying Tracking with Reconstruction
Refine Planes using Differentiable Rendering
Experiments
Implementation Details
Setup
3D Geometric Metric
3D Segmentation Metric
...and 5 more sections

Figures (5)

Figure 1: Comparison between UniPlane and PlanarRecon. Left: predictions from our baseline PlanarReccon. Middle: reconstructions from UniPlane. Right: ground-truth plane reconstruction. Each color represents a plane instance. Textured planes are learned with rendering loss. Our model is able to accurately detect more planes improving both recall and precision.
Figure 2: 2D toy example to illustrate view consistency. Each 2D pixel will project visual features onto voxels accessible by a ray from it. Voxels receiving consistent visual features are occupied, marked by color on the right. Voxels receiving different visual features are unoccupied, and marked as white on the right. Voxels behind occupied voxels are occluded, and marked as gray on the right.
Figure 3: Overall Architecture From a sequence of posed images, we organize them into fragments. A voxel feature grid is constructed for each fragment. This per voxel feature is both used to make a per voxel prediction, and as key and value vector to a transformer decoder. The query vector to the transforms consists of both a sequence of learnable query vector, and plane embedding from previous fragment to track planes across fragments
Figure 4: Refine planes using differentiable rendering
Figure 5: Qualitative Results for Plane Detection on ScanNet. Our method outperforms PlanarRecon. Our approach is able to reconstruct a more complete scene and retrain more details. Different colors indicate different surfaces' segmentation, from much we can observe UniPlane achieves a much better result in both precision and recalls. The spectrum color indicates surface normal. Compared to PlanarRecon, UniPlane predicts much lower normal errors.

UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

TL;DR

Abstract

UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (5)