Table of Contents
Fetching ...

KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models

Rhys Newbury, Juyan Zhang, Tin Tran, Hanna Kurniawati, Dana Kulić

TL;DR

KeyPointDiffuser introduces a fully unsupervised 3D keypoint learning framework that encodes a point cloud into explicit 3D keypoints and an auxiliary latent, then uses a latent diffusion model conditioned on these keypoints to reconstruct and generate 3D shapes. The method integrates a PointTransformerV3-based encoder, a differentiable soft keypoint projection, and a diffusion decoder with a curriculum noise schedule, guided by Chamfer-based geometry losses, deformation consistency, FPS-based coverage, and KL regularization to produce semantically meaningful and repeatable keypoints. Empirical results on ShapeNet show superior keypoint consistency (DAS and correlation) and competitive shape generation/reconstruction across categories, with demonstrable keypoint interpolation and robust performance under varying keypoint counts. Limitations include reliance on point clouds rather than meshes, potential noise in sparse regions, and the absence of explicit symmetry priors; future work suggests direct mesh diffusion and geometric priors to further improve fidelity and structural fidelity in generated shapes.

Abstract

Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.

KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models

TL;DR

KeyPointDiffuser introduces a fully unsupervised 3D keypoint learning framework that encodes a point cloud into explicit 3D keypoints and an auxiliary latent, then uses a latent diffusion model conditioned on these keypoints to reconstruct and generate 3D shapes. The method integrates a PointTransformerV3-based encoder, a differentiable soft keypoint projection, and a diffusion decoder with a curriculum noise schedule, guided by Chamfer-based geometry losses, deformation consistency, FPS-based coverage, and KL regularization to produce semantically meaningful and repeatable keypoints. Empirical results on ShapeNet show superior keypoint consistency (DAS and correlation) and competitive shape generation/reconstruction across categories, with demonstrable keypoint interpolation and robust performance under varying keypoint counts. Limitations include reliance on point clouds rather than meshes, potential noise in sparse regions, and the absence of explicit symmetry priors; future work suggests direct mesh diffusion and geometric priors to further improve fidelity and structural fidelity in generated shapes.

Abstract

Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.

Paper Structure

This paper contains 55 sections, 46 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Top: Overview of the keypoint-conditioned 3D shape generation pipeline. The input point cloud $S_0$ is encoded into a structured latent code $z_0 = z_{\text{kp}} \oplus z_{\text{aux}}$, where $z_{\text{kp}}$ denotes the set of learned 3D keypoints and $z_{\text{aux}}$ represents auxiliary latent features sampled from a Gaussian distribution. These keypoints guide a denoising diffusion model to iteratively reconstruct the original shape. The keypoints are regularized using a Chamfer loss $\mathcal{L}_{\text{chamfer}}$ and a deformation consistency loss $\mathcal{L}_{\text{mse}}$, where $\mathcal{T}$ denotes a differentiable geometric transformation applied to the input shape to simulate structured deformations (e.g., stretching, bending, twisting, tapering). Bottom: The reverse diffusion process refines the noisy input to produce a plausible shape consistent with the keypoints. Since the noise is sampled from a standard range ($[-1, 1]$), but diffused shapes can occupy smaller spatial extents, the process appears to "zoom in" as noise is removed, causing keypoints (blue circles) to emerge and grow more prominent over time. Timesteps are sampled on a logarithmic scale.
  • Figure 2: Visualization of the model’s keypoints (colored diamonds) and their corresponding high-attention regions (colored points) overlaid on the input point cloud (gray). Each color represents a distinct keypoint and its associated attention distribution.
  • Figure 3: We visualize the keypoints identified by our method across different instances of the airplane (top), guitar (middle), and chair (bottom) classes, where the same keypoint ID across different instances is visualized with the same color. Keypoints predicted by our method are structurally consistent and repeatable across diverse geometries, demonstrating robustness to shape variation.
  • Figure 4: Linear interpolation in the learned keypoint space between two airplane shapes. The top-left and bottom-right point clouds are reconstructions of point clouds from the test set samples from the dataset, while the four intermediate shapes are generated by decoding linearly interpolated keypoints. The smooth transitions demonstrate the continuity and semantic structure of the learned representation.
  • Figure 5: The result of using NKSR huang2023nksr, a deep-learning based algorithm to predict the mesh from a noisy point cloud, such as the one calculated from our diffusion model. The resulting mesh is noisy and not realistic.
  • ...and 1 more figures