Table of Contents
Fetching ...

BINO: Encoder Centric Self Supervised Stereo With Native Pair Input

Haokun Zhou

Abstract

Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo~2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCo~v2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.

BINO: Encoder Centric Self Supervised Stereo With Native Pair Input

Abstract

Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo~2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCo~v2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.

Paper Structure

This paper contains 44 sections, 14 equations, 2 figures, 16 tables.

Figures (2)

  • Figure 1: Emergence of epipolar structure across depth in the fused DePos representation. Same-row concentration and ground-truth window mass increase steadily from the embedded input state to the final DePos readout on both controlled synthetic stereo and KITTI Stereo 2012 validation. Dashed lines indicate chance levels. The final DePos readout is marked explicitly to show that applying DePos preserves the already-emergent geometry rather than altering it substantially.
  • Figure 2: Counterfactual evidence for binocular causal structure.Left: retention relative to the original DePos readout under replace-right and row-shuffle-right. Row shuffling preserves same-row concentration but sharply reduces ground-truth-window retention, showing that the learned representation depends on the ordered horizontal structure of the complementary view rather than on a generic row prior. Right: under the duplicate-left counterfactual, zero-disparity metrics rise to several times chance, indicating that the representation shifts coherently toward zero disparity when the two views are made identical.