Table of Contents
Fetching ...

Back to the Future Cyclopean Stereo: a human perception approach combining deep and geometric constraints

Sherlon Almeida da Silva, Davi Geiger, Luiz Velho, Moacir Antonelli Ponti

TL;DR

Back to the Future Cyclopean Stereo (B2FS) tackles the need for interpretable stereo by coupling a cyclopean geometry, expressed in the XD space, with deep visual features. It introduces two geometric constraints (GC1 and GC2) and a DP-based fusion with monocular priors to fill occluded and textureless regions, further refined by a Fully Convolutional Regression Network guided by a Hybrid Attention Transformer. The method achieves competitive depth accuracy and superior structural detail on Middlebury at 256×256, particularly in depth discontinuities and in low-resolution scenarios. By blending explicit 3D geometry with learning-based cues, B2FS demonstrates a path toward more robust, explainable stereo systems with potential impact on virtual reality, robotics, and autonomous navigation.

Abstract

We innovate in stereo vision by explicitly providing analytical 3D surface models as viewed by a cyclopean eye model that incorporate depth discontinuities and occlusions. This geometrical foundation combined with learned stereo features allows our system to benefit from the strengths of both approaches. We also invoke a prior monocular model of surfaces to fill in occlusion regions or texture-less regions where data matching is not sufficient. Our results already are on par with the state-of-the-art purely data-driven methods and are of much better visual quality, emphasizing the importance of the 3D geometrical model to capture critical visual information. Such qualitative improvements may find applicability in virtual reality, for a better human experience, as well as in robotics, for reducing critical errors. Our approach aims to demonstrate that understanding and modeling geometrical properties of 3D surfaces is beneficial to computer vision research.

Back to the Future Cyclopean Stereo: a human perception approach combining deep and geometric constraints

TL;DR

Back to the Future Cyclopean Stereo (B2FS) tackles the need for interpretable stereo by coupling a cyclopean geometry, expressed in the XD space, with deep visual features. It introduces two geometric constraints (GC1 and GC2) and a DP-based fusion with monocular priors to fill occluded and textureless regions, further refined by a Fully Convolutional Regression Network guided by a Hybrid Attention Transformer. The method achieves competitive depth accuracy and superior structural detail on Middlebury at 256×256, particularly in depth discontinuities and in low-resolution scenarios. By blending explicit 3D geometry with learning-based cues, B2FS demonstrates a path toward more robust, explainable stereo systems with potential impact on virtual reality, robotics, and autonomous navigation.

Abstract

We innovate in stereo vision by explicitly providing analytical 3D surface models as viewed by a cyclopean eye model that incorporate depth discontinuities and occlusions. This geometrical foundation combined with learned stereo features allows our system to benefit from the strengths of both approaches. We also invoke a prior monocular model of surfaces to fill in occlusion regions or texture-less regions where data matching is not sufficient. Our results already are on par with the state-of-the-art purely data-driven methods and are of much better visual quality, emphasizing the importance of the 3D geometrical model to capture critical visual information. Such qualitative improvements may find applicability in virtual reality, for a better human experience, as well as in robotics, for reducing critical errors. Our approach aims to demonstrate that understanding and modeling geometrical properties of 3D surfaces is beneficial to computer vision research.

Paper Structure

This paper contains 23 sections, 1 theorem, 8 equations, 54 figures.

Key Result

Proposition 2.4

GCs for opaque surfaces GC1. The size of the jump, along an epipolar line, of a R- (L-) discontinuity is equal to the size of the L- (R-) occlusion. GC2. Each cyclopean coordinate $(e,x)$ has one and only one disparity, i.e., $d$ is a function $d:(e,x) \rightarrow \mathbb{R}$.

Figures (54)

  • Figure 1: A Comparison of 256x256 results: we present the left image followed by RAFT-Stereo lipson2021raft, Selective-IGEV wang2024selective, and B2FS (Ours). In each image a rectangle is selected and zoomed in (overlayed over the image) to show the visual differences in such areas.
  • Figure 2: Space Transformation from L$\times$R CS (left) to the XD (right). The colors represent a disparity value. Empty positions in XD space are disallowed, while the 'red dot' data are obtained via a bilinear interpolation from LR space data. The XD space has twice the resolution of the LR space, for each epipolar line.
  • Figure 3: $D^{L/R}(e,l/r)$ is the depth from L/R CS, respectively. A point $P$ in 3D is described by the XD as $P=P^{C}(e, x, Z={\cal D}(e,x ))$. The same point can be described by the L/R CS as $P=P^{L,R}(e,X_{l,r}, {\cal D}(e,x ))$, where $X_{l,r} \ne l,r$, since $l,r$ are the projective projection of $P$ into the L/R CS, while $X_{l,r}$ is the simpler orthogonal projection of $P$ into the L/R CS. Note that $B=X_l-X_r$. The distance to $P$ measured by the L, R, and cyclopean eye are ${\cal D}^L(e,l), {\cal D}^R(e,r), {\cal D}(e,x)$, respectively, and they are all different values. Note that the relation $D= f \frac{B}{d}$ assumes $d=r-l$, but our definition of disparity requires a factor 2.
  • Figure 4: An epipolar slice of a surface with left occlusion region and its description by the XD. a. A top view of the epipolar slice of the surface and the two eyes projections. The baseline ${\bf B}$ connects L to R focal centers. The depth axis describe the inverse of disparity (Equation \ref{['eq:disparity-depth']} depicted). b. A discrete XD, a rotation of the L-R CS described by Equation \ref{['eq:cyclopean-transformation']}. Note that the R CS is pointing down. The two red vertical dashed lines delimit the L occlusion area which are associated with a R discontinuity along the horizontal blue dashed line with a jump of the same size as the left occlusion, as described by $GC1$ in Proposition \ref{['lemma:Geometrical-Constraint']}). Note that the only two (2) light green squares (the ones without an "y" in them) are seen by the XD associated with the L occlusion, satisfying one disparity per coordinate $x$ (as postulated by $GC2$), which is half of the size of the L occlusion area.
  • Figure 5: Workflow. Step (1) our approach receives an image-pair with resolution $r^{H,W,3}$. In order to obtain full resolution features and in anticipation of RAFT-Stereo resolution reduction, Step (2) upscale to $r^{4H,4W,3}$ using Hybrid Attention Transformer chen2023activating. Step (3), RAFT-Stereo performs at $\frac{1}{4}$ of $r^{4H,4W,3}$, resulting in $r^{H,W,256}$ for ${F^L, F^R}$. Step (4) uses bilinear interpolation to produce twice as much the width resolution $r^{H,2W,256}$, and achieve the subpixel data for the XD space (the red dots in Figure \ref{['fig:space_transformation']}). Steps (5) and (6) perform the feature dot products to create the correlation matrix by epipolar lines at resolution $r^{2W,2W,1}$. Step (7) transfer the correlation data to the XD space, with the max disparity considered shown by a solid red horizontal line (cutting the need to search beyond such disparity). Step (8) run DP to obtain a Disparity mask at good data matching coordinates i.e., where a binary Data mask indicates the non occluded and non homogeneous regions. Step (8) also performs monocular depth. Step (9) performs the fill in information of the surface where DP did not have a solution (homogeneous and occlusions) via a FCRN where the input is a normalized monocular depth and the output is the DP disparity map only where such solution is available by the data mask.
  • ...and 49 more figures

Theorems & Definitions (4)

  • Definition 2.1: Opaque Surfaces and Stereo
  • Definition 2.2: Transparent Surfaces and Stereo
  • Definition 2.3: Occlusions and Discontinuities
  • Proposition 2.4