Table of Contents
Fetching ...

TwoSquared: 4D Generation from 2D Image Pairs

Lu Sang, Zehranaz Canfes, Dongliang Cao, Riccardo Marin, Florian Bernard, Daniel Cremers

TL;DR

TwoSquared addresses 4D generation from only two 2D frames by splitting the task into an image-to-3D generation step for endpoints and a physically grounded velocity-field deformation to interpolate. It combines a flexible 3D generation backbone with a Vertex Registration module that establishes robust correspondences via a functional map, and a Shape Deformation module that optimizes a velocity field under physical constraints to produce a continuous 4D sequence with texture- and geometry-consistency. The method is template-free, works on in-the-wild inputs, supports arbitrary frame rates without retraining, and demonstrates superior geometry and texture quality on 4D-DRESS and web-image scenarios. This work enables practical, controllable, minimal-input 4D generation and has potential to augment Generative AI pipelines with dynamic, physically plausible content.

Abstract

Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

TwoSquared: 4D Generation from 2D Image Pairs

TL;DR

TwoSquared addresses 4D generation from only two 2D frames by splitting the task into an image-to-3D generation step for endpoints and a physically grounded velocity-field deformation to interpolate. It combines a flexible 3D generation backbone with a Vertex Registration module that establishes robust correspondences via a functional map, and a Shape Deformation module that optimizes a velocity field under physical constraints to produce a continuous 4D sequence with texture- and geometry-consistency. The method is template-free, works on in-the-wild inputs, supports arbitrary frame rates without retraining, and demonstrates superior geometry and texture quality on 4D-DRESS and web-image scenarios. This work enables practical, controllable, minimal-input 4D generation and has potential to augment Generative AI pipelines with dynamic, physically plausible content.

Abstract

Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

Paper Structure

This paper contains 20 sections, 14 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: TwoSquared takes a pair of 2D images representing the initial and final states of an object as input and generates texture-consistent, geometry-consistent 4D continuous sequences. It is designed to be robust to varying input quality, operating without the need for predefined templates or object-class priors. This adaptability enables greater flexibility in processing diverse images while maintaining structural integrity and visual coherence throughout the generated sequences. As demonstrated, our approach effectively handles humans, animals, and inanimate objects.
  • Figure 2: Pipeline of TwoSquared: TwoSquared processes two input images through an image-to-3D generation block, producing two 3D meshes. We then extract per-vertex features and compute a cosine similarity map, which is refined using a functional map module and a close loop check module to obtain point-to-point correspondences and a confidence map. These registered points are then fed into our shape deformation module, where we model their trajectory of the deformed point cloud. During the inference time, we can directly infer the generated textured mesh from $I_0$ to obtain the 4D sequence.
  • Figure 3: Comparison with other methods: TwoSquared generates texture-consistent, physically plausible 4D sequences, and it is more robust then 4Deform sang20254deform to correspondences noise. In contrast, other methods show artifacts in the intermediate shapes.
  • Figure 4: Visualization of ablations: We show the qualitative results of our ablation study. while the visual results coincide with the error number reported in \ref{['tab:ablation']}, the stretching loss and overlapping loss help the shape to remain physically plausible. While the normal loss has little qualitative impact, it does lead to quantitative improvements -- see Table \ref{['tab:ablation']}.
  • Figure 5: 4D reconstruction from web images. Our method can take a pair of images as input and generate the temporal consistent 4D deformation between these two objects.
  • ...and 10 more figures