KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models
Rhys Newbury, Juyan Zhang, Tin Tran, Hanna Kurniawati, Dana Kulić
TL;DR
KeyPointDiffuser introduces a fully unsupervised 3D keypoint learning framework that encodes a point cloud into explicit 3D keypoints and an auxiliary latent, then uses a latent diffusion model conditioned on these keypoints to reconstruct and generate 3D shapes. The method integrates a PointTransformerV3-based encoder, a differentiable soft keypoint projection, and a diffusion decoder with a curriculum noise schedule, guided by Chamfer-based geometry losses, deformation consistency, FPS-based coverage, and KL regularization to produce semantically meaningful and repeatable keypoints. Empirical results on ShapeNet show superior keypoint consistency (DAS and correlation) and competitive shape generation/reconstruction across categories, with demonstrable keypoint interpolation and robust performance under varying keypoint counts. Limitations include reliance on point clouds rather than meshes, potential noise in sparse regions, and the absence of explicit symmetry priors; future work suggests direct mesh diffusion and geometric priors to further improve fidelity and structural fidelity in generated shapes.
Abstract
Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
