Table of Contents
Fetching ...

Self-Improving 4D Perception via Self-Distillation

Nan Huang, Pengcheng Yu, Weijia Zeng, James M. Rehg, Angjoo Kanazawa, Haiwen Feng, Qianqian Wang

Abstract

Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

Self-Improving 4D Perception via Self-Distillation

Abstract

Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and ), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

Paper Structure

This paper contains 29 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: We present SelfEvo, a self-improving framework for learning-based multi-view reconstruction via self-distillation, requiring no ground-truth annotations.
  • Figure 2: We propose an annotation-free self-improving framework that continually post-trains pretrained multi-view reconstruction models using unlabeled videos. Our method forms an online self-distillation loop where a richer-context teacher provides stop-gradient pseudo targets to a student operating on reduced context, and is updated as an EMA of the student after each step.
  • Figure 3: Visual result on unseen-domain data, including animal motion, robotics and ego-centric.
  • Figure 4: Qualitative results for camera and geometry on in the wild videos.
  • Figure 5: Context improves feedforward reconstruction quality. Starting from two temporally distant anchor frames, we progressively add intermediate frames as context and evaluate performance only on the anchors. As the number of input views increases, the overall covisibility increases, while both pointmap and pose errors decrease.
  • ...and 4 more figures