Table of Contents
Fetching ...

What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

David Yan, Alexander Raistrick, Jia Deng

TL;DR

The paper tackles the challenge of understanding what makes synthetic data effective for zero-shot stereo matching. It adopts a procedural data-generation approach built on Infinigen to perform a thorough parameter study, identifying factors such as floating-object density, background realism, material diversity, lighting, and baseline variation that influence zero-shot performance. The authors construct WMGStereo-150k (163,666 pairs) using the best parameters, demonstrating superior zero-shot generalization over many existing datasets and competitive results with FoundationStereo, while also showing strong sample efficiency. By open-sourcing the generation code and providing a detailed parameter analysis, the work offers a practical framework for designing future synthetic stereo datasets and advancing zero-shot depth learning.

Abstract

Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.

What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

TL;DR

The paper tackles the challenge of understanding what makes synthetic data effective for zero-shot stereo matching. It adopts a procedural data-generation approach built on Infinigen to perform a thorough parameter study, identifying factors such as floating-object density, background realism, material diversity, lighting, and baseline variation that influence zero-shot performance. The authors construct WMGStereo-150k (163,666 pairs) using the best parameters, demonstrating superior zero-shot generalization over many existing datasets and competitive results with FoundationStereo, while also showing strong sample efficiency. By open-sourcing the generation code and providing a detailed parameter analysis, the work offers a practical framework for designing future synthetic stereo datasets and advancing zero-shot depth learning.

Abstract

Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.

Paper Structure

This paper contains 21 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Visualization of synthetic dataset design choices. For each studied parameter, we show a single illustrative example demonstrating the effect of changing dataset generation parameters such as object density, background realism, object types, lighting, and materials. Each example is sampled from a dataset evaluated in Tab. \ref{['tab:parameter_analysis_quantitative']}.
  • Figure 2: WMGStereo-150k Dataset. From left to right, we show random, non-cherrypicked samples from our Indoors with Floating Objects, Dense Floating Objects, and Nature scene types. Refer to the Supplement for additional samples.
  • Figure 3: End-point-error averaged by object and material. We show average error only for the materials that make up at least 0.1% of pixels. Assets marked in red were removed from our system after manual inspection because they introduce ambiguous high-error cases, such as entirely transparent or reflective surfaces, or thin imperceptible foliage / holes.
  • Figure 4: Qualitative comparison of in-the-wild predictions. We train DLNR on WMGStereo-150k and existing synthetic datasets. Training on our dataset achieves superior predictions on textureless regions (top row; blank ceiling), nature details (middle row; background leaves) and non-Lambertian surfaces (bottom row; TV screen). Images are from InStereo2k bao2020instereo2k, Flickr1024 wang2019flickr1024, and Booster zamaramirez2024booster.
  • Figure 5: Zero-shot performance by dataset size. Our dataset is more sample-efficient than SceneFlow and CREStereo, achieving better performance across a range of dataset sizes. Our dataset achieves competitive performance with FSD.
  • ...and 4 more figures