Table of Contents
Fetching ...

Revisiting Shape from Polarization in the Era of Vision Foundation Models

Chenhao Li, Taishi Ono, Takeshi Uemori, Yusuke Moriuchi

TL;DR

It is shown that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation and incorporate pretrained DINOv3 priors to improve generalization to unseen objects.

Abstract

We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.

Revisiting Shape from Polarization in the Era of Vision Foundation Models

TL;DR

It is shown that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation and incorporate pretrained DINOv3 priors to improve generalization to unseen objects.

Abstract

We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.
Paper Structure (21 sections, 3 equations, 8 figures, 2 tables)

This paper contains 21 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our method surpasses the previous best SfP approach (SfPUEL lyu2024sfpuel), a leading discriminative VFM (MoGe2 wang2025moge2), a generative VFM (StableNormal ye2024stablenormal), and a commercial inverse rendering tool (SwitchLight3 beeble_switchlight3). Moreover, the benefit of using polarization cues is clear by comparing with our RGB-only ablation. The numbers shown below each method indicate frames per second (FPS) and mean angular error (MAE). Inference speed for all models is tested on a V100 GPU with a resolution of 512 × 612 and FP16 precision.
  • Figure 2: Data augmentation pipeline and model architecture.
  • Figure 3: Visualization of a plastic ball in real and synthetic data with noise simulation. In real-world measurements, AoLP is consistently noisy due to sensor and acquisition artifacts. In contrast, rendered AoLP appears overly clean because of the idealized sensor model. Directly injecting noise into RGB or AoLP is not realistic (the noise level is amplified here for visualization). Instead, applying augmentation before polarization signal processing better matches real noise characteristics: RGB domain is less affected and AoLP noise is concentrated in regions with rapid AoLP direction changes.
  • Figure 4: Qualitative comparisons on Our real w/ GT.
  • Figure 5: Qualitative comparisons on Our real w/o GT (Please zoom in for details).
  • ...and 3 more figures