Table of Contents
Fetching ...

DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

Xinyi Zhang, Naiqi Li, Angela Dai

TL;DR

DNF addresses unconditional 4D generation of deforming shapes by introducing a dictionary-based neural field representation that decouples shape and motion via a shared dictionary derived from SVD of MLPs. A transformer-based diffusion model operates in the weight space of the dictionary-encoded 4D fields, with separate shape and motion streams and a sliding-window strategy to manage long sequences. Key contributions include dictionary-based fine-tuning with a compressed dictionary and residual extensions, plus per-shape coefficient vectors that preserve contiguity while enabling high fidelity. Experiments on DeformingThings4D demonstrate state-of-the-art generation quality and generalization to unseen identities, offering a compact, scalable approach for high-dimensional dynamic 3D data.

Abstract

While remarkable success has been achieved through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields. Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.

DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

TL;DR

DNF addresses unconditional 4D generation of deforming shapes by introducing a dictionary-based neural field representation that decouples shape and motion via a shared dictionary derived from SVD of MLPs. A transformer-based diffusion model operates in the weight space of the dictionary-encoded 4D fields, with separate shape and motion streams and a sliding-window strategy to manage long sequences. Key contributions include dictionary-based fine-tuning with a compressed dictionary and residual extensions, plus per-shape coefficient vectors that preserve contiguity while enabling high fidelity. Experiments on DeformingThings4D demonstrate state-of-the-art generation quality and generalization to unseen identities, offering a compact, scalable approach for high-dimensional dynamic 3D data.

Abstract

While remarkable success has been achieved through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields. Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.

Paper Structure

This paper contains 35 sections, 13 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We propose DNF, a dictionary-based representation for the unconditional generation of 4D deforming shapes, with a transformer-based diffusion model. Our method is capable of generating motions with superior shape quality and temporal consistency.
  • Figure 2: Overview for learning our 4D dynamic DNF representation. We first pre-train disentangled shape and motion MLPs with per-instance latents. We then decompose the pre-trained MLPs using SVD to conduct dictionary-based fine-tuning of the singular values for each train instance, in order to more expressively capture local object detail. We then obtain for each train instance its latent shape and motion codes as well as coefficient vectors, along with a globally shared dictionary. This effectively balances quality, contiguity and compression in the learned representation space.
  • Figure 3: Training and generation of our DNFs for unconditional 4D synthesis. We employ transformer-based diffusion models to model the $\boldsymbol{\sigma}$ that modulate the shape and motion MLPs, along with shape and motion codes. At inference time, new samples can then be decoded to shape and motion to form a 4D deforming sequence.
  • Figure 4: Qualitative comparison with state of the art. Our dictionary-based approach enables generating 4D sequences with higher shape fidelity and temporal consistency.
  • Figure 5: Distribution of the average chamfer distance for all generations of our method to their nearest neighbors from the train set, showing that our method is able to synthesize new motions.
  • ...and 1 more figures