Table of Contents
Fetching ...

PoolNet: Deep Learning for 2D to 3D Video Process Validation

Sanchit Kaul, Joseph Luna, Shray Arora

TL;DR

The paper tackles the inefficiency of running structure-from-motion on vast, uncurated public data by proposing PoolNet, a CNN-encoder and Transformer-based model that predicts, before reconstruction, whether a frame sequence will yield a usable COLMAP result. It formalizes targets for reconstruction success and geometry, builds its training on the Redwood RGB-D dataset with carefully crafted label generation, and demonstrates that frame-level embeddings generalize while scene-level predictions achieve high accuracy. The method achieves approximately 80% accuracy on balanced scene data and delivers substantial compute savings, reducing inference time to roughly 17% of the next-fastest baseline, enabling scalable 3D data collection from public sources. These results suggest PoolNet can effectively filter large-scale web-curated video collections to improve downstream 3D reconstruction quality and efficiency.

Abstract

Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.

PoolNet: Deep Learning for 2D to 3D Video Process Validation

TL;DR

The paper tackles the inefficiency of running structure-from-motion on vast, uncurated public data by proposing PoolNet, a CNN-encoder and Transformer-based model that predicts, before reconstruction, whether a frame sequence will yield a usable COLMAP result. It formalizes targets for reconstruction success and geometry, builds its training on the Redwood RGB-D dataset with carefully crafted label generation, and demonstrates that frame-level embeddings generalize while scene-level predictions achieve high accuracy. The method achieves approximately 80% accuracy on balanced scene data and delivers substantial compute savings, reducing inference time to roughly 17% of the next-fastest baseline, enabling scalable 3D data collection from public sources. These results suggest PoolNet can effectively filter large-scale web-curated video collections to improve downstream 3D reconstruction quality and efficiency.

Abstract

Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.

Paper Structure

This paper contains 16 sections, 1 equation, 1 table.