Table of Contents
Fetching ...

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

TL;DR

LaRI introduces layered ray intersections to reason about unseen geometry from a single image, encoding all ray-surface intersections as a fixed-size layered point map and a ray stopping index to identify valid layers. The approach regresses the layered geometry and the stopping index with a ViT-based encoder-decoder and a dedicated data-construction pipeline, enabling view-aligned, complete 3D reasoning that unifies object- and scene-level tasks. Empirically, LaRI achieves competitive object-level results with only a fraction of the training data and parameters used by large generative models, while delivering scene-level unseen geometry reasoning in a single forward pass. Limitations include density biases for rays parallel to surfaces and dataset limitations, but the method offers a scalable, efficient framework for single-view geometric reasoning about unseen structures.

Abstract

We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI's performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

TL;DR

LaRI introduces layered ray intersections to reason about unseen geometry from a single image, encoding all ray-surface intersections as a fixed-size layered point map and a ray stopping index to identify valid layers. The approach regresses the layered geometry and the stopping index with a ViT-based encoder-decoder and a dedicated data-construction pipeline, enabling view-aligned, complete 3D reasoning that unifies object- and scene-level tasks. Empirically, LaRI achieves competitive object-level results with only a fraction of the training data and parameters used by large generative models, while delivering scene-level unseen geometry reasoning in a single forward pass. Limitations include density biases for rays parallel to surfaces and dataset limitations, but the method offers a scalable, efficient framework for single-view geometric reasoning about unseen structures.

Abstract

We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI's performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Layered ray intersection (LaRI) models multiple 3D surfaces from a single view by representing ray-surface intersections into depth-ordered, layered 3D point maps (color for different layers: 1, 2, 3, 4, $\dots$). This enables us to unify the object- and scene-level geometric reasoning tasks, with simultaneous support for complete, efficient, and view-aligned modeling.
  • Figure 2: Overview.Left: Given an image captured by a camera, the 3D geometry can be represented by the intersections between the camera ray and object surfaces. While conventional depth estimation methods only model the first intersection (i.e., layer 1), lari represents all intersections (e.g., layer 1, 2, 3, $\dots$) with layered 3D point maps (we use depth here only for visualization purpose), allowing for reasoning about unseen structures from visible cues. Right: We conduct single-view 3D geometric reasoning by formulating it as a standard 2D regression task. Given an input image $\mathbf{I}$, the model predicts the lari map $\mathbf{V} \in \mathbb{R}^{H\times W \times L \times 3}$, which represents all possible intersection coordinates with a fixed layer number. It further identifies the valid ray intersections in the lari map by regressing the ray stopping index $\mathbf{C} \in \mathbb{R}^{H\times W \times L}$, which is transformed into binary masks $\mathbf{M}$ to derive the final point cloud $\hat{\mathbf{V}}$.
  • Figure 3: Qualitative comparisons on GSO downs2022google. All methods are evaluated with view-aligned GT. Our method yields visually more plausible results than existing methods huang2025spar3dboss2024sf3d trained similarly with Objaverse. Meanwhile, our method estimates 3D structures more faithfully to the input image, compared to the large generative model.
  • Figure 4: Qualitative comparisons on SCRREAM downs2022google. Compared to methods focusing only on visible surface reconstruction, our method significantly extends the modeling coverage by reasoning unseen regions (i.e., the colored regions) with different layers organized in a depth-ordered manner. For instance, the near unseen regions (e.g., the self-occluded bed and sofa) are reasoned by layer 2 and the farther regions (e.g., the floors and wall) are reasoned by layer 3.
  • Figure 5: Limitations. lari yields lower point density in surfaces parallel to the camera ray, or areas in-between layers. As a deterministic approach, our current method might fail to reason a plausible shape when given a limited observation, e.g., with high occlusions.