Table of Contents
Fetching ...

Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Xinya Chen, Hanlei Guo, Yanrui Bin, Shangzhan Zhang, Yuanbo Yang, Yue Wang, Yujun Shen, Yiyi Liao

TL;DR

This paper introduces TeFF, a template feature field that enables 3D-aware GAN training from unposed images by jointly learning a 3D semantic template alongside the radiance field. Pose estimation for real images is performed on the fly via discretized camera poses and phase correlation, leveraging semantically aligned DINO features to recover full 3D geometry across challenging datasets. The approach uses a background 2D generator and dual discriminators to stabilize training, and EMA-derived template fields to facilitate pose matching. Across cars, planes, and elephants, TeFF outperforms state-of-the-art baselines in FID, depth accuracy, and pose distribution fidelity, demonstrating robust 360-degree rendering without ground-truth poses. The method advances scalable 3D-aware generative modeling for real-world, unposed image collections, with limitations including a single template per category and sensitivity to perspective distortions.

Abstract

Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.

Learning 3D-Aware GANs from Unposed Images with Template Feature Field

TL;DR

This paper introduces TeFF, a template feature field that enables 3D-aware GAN training from unposed images by jointly learning a 3D semantic template alongside the radiance field. Pose estimation for real images is performed on the fly via discretized camera poses and phase correlation, leveraging semantically aligned DINO features to recover full 3D geometry across challenging datasets. The approach uses a background 2D generator and dual discriminators to stabilize training, and EMA-derived template fields to facilitate pose matching. Across cars, planes, and elephants, TeFF outperforms state-of-the-art baselines in FID, depth accuracy, and pose distribution fidelity, demonstrating robust 360-degree rendering without ground-truth poses. The method advances scalable 3D-aware generative modeling for real-world, unposed image collections, with limitations including a single template per category and sensitivity to perspective distortions.

Abstract

Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.
Paper Structure (19 sections, 5 equations, 19 figures, 9 tables)

This paper contains 19 sections, 5 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Qualitative Comparison of 3DGP ($\{cols. 1-3\}$), PoF3D ($\{cols. 4-6\}$), Ours ($\{cols. 7-9\}$) when rendering in the same camera pose. Note that objects generated by 3DGP and PoF3D face different directions with the same camera pose while ours face the same direction.
  • Figure 2: Method Overview. We augment the generative radiance field with a semantic feature field, enabling estimating camera poses of real images on the fly to facilitate the 3D-aware GAN training. Specifically, we map a randomly sampled noise vector to a radiance field and a semantic feature field. By taking the mean shape of the feature field, we obtain a 3D template feature field. This allows us to perform efficient 2D-3D pose estimation to estimate camera poses of real images, which are in turn fed into the generator to perform volume rendering. The blue part is the auxiliary task of pose estimation we introduced.
  • Figure 3: Camera Pose Estimation. We leverage the template feature field to estimate camera poses of 2D real images. We discretize the azimuth $\theta$ and elevation $\phi$ angles and render the feature field from these discretized camera poses. Then we use phase correlation to estimate the scale and the in-plane rotation in the 2D image space and warp each of the rendered templates based on the solution. We calculate the mean square error between the warped rendering and the real feature and further obtain the probability distribution function of the camera pose. Finally, we sample the camera pose using inverse sampling.
  • Figure 4: Camera Model.
  • Figure 5: Qualitative Comparison on SDIP Elephant. We show each sample from 360 viewing directions.
  • ...and 14 more figures