Table of Contents
Fetching ...

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, Jiyuan Wang

TL;DR

RePer-360 is proposed, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors and surpasses standard fine-tuning methods while using only 1% of the training data.

Abstract

Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

TL;DR

RePer-360 is proposed, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors and surpasses standard fine-tuning methods while using only 1% of the training data.

Abstract

Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360 images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.
Paper Structure (14 sections, 11 equations, 8 figures, 4 tables)

This paper contains 14 sections, 11 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Technical strategies based on pretrained perspective models (Persp): (Top) Patch fusion suffers from artifacts; (Middle) End-to-end fine-tuning requires massive data; (Bottom) Our RePer-360 enables precise detail transfer with minimal data.
  • Figure 2: Overview of the proposed framework. Our framework takes a single panorama as input and outputs the corresponding depth map. The network leverages the rich visual priors in the DAM to make it suitable for panoramic depth estimation.
  • Figure 3: Geometry-Aligned Guidance (GAG). The left heatmap visualizes spatially adaptive gating between CP and ERP features. The right schematic illustrates the guidance construction process used to generate modulation inputs for SCAdaLN-Zero.
  • Figure 4: The top-left panel illustrates the standard DiT block. The bottom-left panel shows our SCAdaLN-Zero extension embedded in the frozen backbone. The right side details the flow: it utilizes an internal geometry-derived signal from GAG as a self-conditioning signal to modulate backbone features.
  • Figure 5: Qualitative comparison between the current SOTA method PanDA-L cao2025panda and ours. The results of PanDA-L are obtained using their released weights. The top sample is from Matterport3D, while the bottom one is from Stanford2D3D.
  • ...and 3 more figures