Table of Contents
Fetching ...

Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation

Irene Kim, Sai Tanmay Reddy Chakkera, Alexandros Graikos, Dimitris Samaras, Akshat Dave

Abstract

Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.

Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation

Abstract

Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.

Paper Structure

This paper contains 58 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Test-time polarization guidance to enhance normal estimation. (a) Polarization-based feed-forward models have limited generalizability due to the scarcity of polarization--normal training pairs. (b) RGB-only monocular normal estimators produce oversmoothed or hallucinated details on challenging surfaces -- normals of the textureless bunny object appear flatter than ground truth. (c) Poppy introduces polarization guidance into pretrained RGB-only models at test time -- improving normal accuracy without retraining.
  • Figure 2: Radiance decomposition. The mixed radiance $S_0$ is decomposed into diffuse radiance $L_d$ and specular radiance $L_s$ from our method. The learned $L_s$ captures specular highlights and environment-dependent reflections (scaled 2$\times$ for clarity), while $L_d$ retains the object's intrinsic diffuse shading and texture. From these radiance components and the predicted normals, the polarization maps (AoLP$\times$DoLP) of the diffuse, specular, and combined components can be obtained.
  • Figure 3: Poppy pipeline. Given polarization measurements, we compute the observed Stokes map $\mathbf{S}$, extract the RGB image $x$, and add learnable image offset $O_x$ to $x$. A frozen backbone produces base normals $\hat{n}_{\text{base}}$; a learnable normal offset $O_n$ yields the refined estimate $\hat{n}_t = \hat{n}_{\text{base}} + O_n$. Using Fresnel equations, the predicted Stokes $\hat{\mathbf{S}}$ is computed from $\hat{n}_t$ and specular radiance $L_s$. We minimize the polarization consistency loss between $\hat{\mathbf{S}}$ and $\mathbf{S}$ to update the image offset $O_x$, normal offset $O_n$, and specular map $L_s$ over $T$ steps, while keeping backbone weights fixed.
  • Figure 4: (a) Stokes reconstruction of hedgehog scene (from NeRSP) when: $L_s=0$ (diffuse only) in column 1; only $L_s$ is learned from backbone predicted normals $\hat{n}_{\text{base}}$ in column 2; $L_s, O_{x},$ and $O_{n}$ are jointly learned in column 3. (b) Jacobian magnitude maps for a selected input pixel, normalized by the 99th percentile for different backbones, showing how perturbations of a single input pixel (green dot) influence the output normal map at a global level, across spatial locations.
  • Figure 5: Ablation of guidance variants on real and synthetic datasets. (a) Mean angular error (MAE) across three backbones (Marigold, Lotus-v2, MoGe-2) with no guidance (None), image offset guidance (Image), and joint image and normal offsets guidance (Joint). Guidance improves normals, with joint guidance performing slightly better on synthetic datasets. (b) MAE over optimization steps on SfPUEL and NeRSP with MoGe-2 backbone. The red-dashed line at $t{=}50$ marks the activation of the normal offset. On synthetic data, joint guidance accelerates error reduction. On real data, sensor noise causes the normal offset to slightly increase MAE, though high-frequency detail is still recovered.
  • ...and 10 more figures