Table of Contents
Fetching ...

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, Jan Eric Lenssen

TL;DR

The paper addresses the limited 3D understanding of 2D vision foundation models by introducing a two-stage approach that first lifts 2D features into a 3D Gaussian representation and then fine-tunes the 2D backbone using rendered 3D-aware features. This 3D-aware fine-tuning enables simple linear probing to improve downstream semantic segmentation and depth estimation, with demonstrated transferability from indoor scans to out-of-domain datasets and other vision-model families. The key contributions include a memory-efficient 3D Gaussian feature representation, an efficient rendering and up-projection pipeline, and a fine-tuning regime that preserves 2D generalization while embedding 3D awareness. The results show consistent gains across indoor datasets and notable generalization to outdoor and varied models, highlighting the practical impact of injecting 3D priors into 2D foundation models for enhanced 3D reasoning in vision tasks.

Abstract

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

TL;DR

The paper addresses the limited 3D understanding of 2D vision foundation models by introducing a two-stage approach that first lifts 2D features into a 3D Gaussian representation and then fine-tunes the 2D backbone using rendered 3D-aware features. This 3D-aware fine-tuning enables simple linear probing to improve downstream semantic segmentation and depth estimation, with demonstrated transferability from indoor scans to out-of-domain datasets and other vision-model families. The key contributions include a memory-efficient 3D Gaussian feature representation, an efficient rendering and up-projection pipeline, and a fine-tuning regime that preserves 2D generalization while embedding 3D awareness. The results show consistent gains across indoor datasets and notable generalization to outdoor and varied models, highlighting the practical impact of injecting 3D priors into 2D foundation models for enhanced 3D reasoning in vision tasks.

Abstract

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.
Paper Structure (22 sections, 3 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: We propose 3D-aware fine-tuning to improve 2D foundation features. Our method starts with lifting 2D image features (e.g. DINOv2 oquab2023dinov2) (b) to a 3D representation. Then we finetune the 2D foundation model using the 3D-aware features(c). We demonstrate that incorporating the fine-tuned features(d) results in improved performance on downstream tasks such as semantic segmentation and depth estimation on a variety of datasets with simple linear probing (right). Feature maps are visualized using principal component analysis (PCA).
  • Figure 2: Our 3D-aware fine-tuning is universal and applicable to a variety of 2D vision models, e.g. DINOv2 oquab2023dinov2, DINOv2-reg darcet2023vision, CLIP radford2021learning, MAE he2022masked, and DeiT-III touvron2022deit (c.f. Sec. \ref{['sec:multi_models']}).
  • Figure 3: Overall pipeline. We present a two-stage pipeline. In the first stage, we lift 2D foundation features (e.g. DINOv2 oquab2023dinov2) into 3D-aware features by training 3D Gaussian representation $\mathcal{G}_{i}$. In the second stage, we use the rendered features to finetune the 2D foundation model $\varepsilon ^{2D}$. With $\rightarrow$we denote gradient flow.
  • Figure 4: Lifting 2D features into 3D Gaussian representation. We equip each Gaussian with a low-dimensional feature vector $\mathbf{f}$. We render colors using the same color rasterizer as Gaussian splatting kerbl20233d. We design a feature rasterizer to render a low-dimensional feature image $\mathbf{F}^{\mathrm{low}}$, which is subsequently projected to a high-dimensional feature image $\mathbf{F}^{\mathrm{high}}$ using a simple CNN. We use 2D foundation features $\mathbf{F}$ from model $\varepsilon ^{\mathrm{2D}}$ to supervise the feature learning.
  • Figure 5: Semantic segmentation on indoor datasets with linear probing. Incorporating our 3D-aware fine-tuned features helps obtain cleaner and more compact segmentation results, especially for detailed structures and in homogeneous regions.
  • ...and 6 more figures