Table of Contents
Fetching ...

Can Shape-Infused Joint Embeddings Improve Image-Conditioned 3D Diffusion?

Cristian Sbrolli, Paolo Cudrano, Matteo Matteucci

TL;DR

This work addresses the limitation of relying on text–image embeddings for 3D shape generation by introducing CISP, a contrastive image–shape pre-training framework that aligns 2D images with 3D shapes in a shared embedding space. By conditioning a 3D diffusion model on CISP embeddings (vs CLIP embeddings), the authors demonstrate comparable generation quality and diversity but significantly improved coherence between generated shapes and the conditioning images, including zero-shot sketch and real-world inputs. The embedding space of CISP shows smoother, more physically plausible transitions in latent space, and the approach generalizes robustly to out-of-distribution inputs. The results motivate further development of large-scale multimodal systems that explicitly incorporate 3D representations to advance image-to-3D content synthesis.

Abstract

Recent advancements in deep generative models, particularly with the application of CLIP (Contrastive Language Image Pretraining) to Denoising Diffusion Probabilistic Models (DDPMs), have demonstrated remarkable effectiveness in text to image generation. The well structured embedding space of CLIP has also been extended to image to shape generation with DDPMs, yielding notable results. Despite these successes, some fundamental questions arise: Does CLIP ensure the best results in shape generation from images? Can we leverage conditioning to bring explicit 3D knowledge into the generative process and obtain better quality? This study introduces CISP (Contrastive Image Shape Pre training), designed to enhance 3D shape synthesis guided by 2D images. CISP aims to enrich the CLIP framework by aligning 2D images with 3D shapes in a shared embedding space, specifically capturing 3D characteristics potentially overlooked by CLIP's text image focus. Our comprehensive analysis assesses CISP's guidance performance against CLIP guided models, focusing on generation quality, diversity, and coherence of the produced shapes with the conditioning image. We find that, while matching CLIP in generation quality and diversity, CISP substantially improves coherence with input images, underscoring the value of incorporating 3D knowledge into generative models. These findings suggest a promising direction for advancing the synthesis of 3D visual content by integrating multimodal systems with 3D representations.

Can Shape-Infused Joint Embeddings Improve Image-Conditioned 3D Diffusion?

TL;DR

This work addresses the limitation of relying on text–image embeddings for 3D shape generation by introducing CISP, a contrastive image–shape pre-training framework that aligns 2D images with 3D shapes in a shared embedding space. By conditioning a 3D diffusion model on CISP embeddings (vs CLIP embeddings), the authors demonstrate comparable generation quality and diversity but significantly improved coherence between generated shapes and the conditioning images, including zero-shot sketch and real-world inputs. The embedding space of CISP shows smoother, more physically plausible transitions in latent space, and the approach generalizes robustly to out-of-distribution inputs. The results motivate further development of large-scale multimodal systems that explicitly incorporate 3D representations to advance image-to-3D content synthesis.

Abstract

Recent advancements in deep generative models, particularly with the application of CLIP (Contrastive Language Image Pretraining) to Denoising Diffusion Probabilistic Models (DDPMs), have demonstrated remarkable effectiveness in text to image generation. The well structured embedding space of CLIP has also been extended to image to shape generation with DDPMs, yielding notable results. Despite these successes, some fundamental questions arise: Does CLIP ensure the best results in shape generation from images? Can we leverage conditioning to bring explicit 3D knowledge into the generative process and obtain better quality? This study introduces CISP (Contrastive Image Shape Pre training), designed to enhance 3D shape synthesis guided by 2D images. CISP aims to enrich the CLIP framework by aligning 2D images with 3D shapes in a shared embedding space, specifically capturing 3D characteristics potentially overlooked by CLIP's text image focus. Our comprehensive analysis assesses CISP's guidance performance against CLIP guided models, focusing on generation quality, diversity, and coherence of the produced shapes with the conditioning image. We find that, while matching CLIP in generation quality and diversity, CISP substantially improves coherence with input images, underscoring the value of incorporating 3D knowledge into generative models. These findings suggest a promising direction for advancing the synthesis of 3D visual content by integrating multimodal systems with 3D representations.
Paper Structure (15 sections, 6 equations, 6 figures, 3 tables)

This paper contains 15 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our study. Our analysis explores the impact of employing image-shape embedding spaces versus text-image embedding spaces for generating 3D shapes from images using DDPMs. The results indicate that both models achieve satisfactory quality and diversity. However, the use of 3D-aware embeddings demonstrates enhanced alignment and consistency between the generated 3D shapes and the original conditioning images.
  • Figure 2: Our image-conditioned 3D generation pipeline. The query image is processed via a pre-trained image encoder $E_i$ to produce image embeddings. These embeddings are used, in combination with additional context provided by $E_c$, to condition a 3D DDPM. Notably, the pre-trained image encoder $E_i$ can be either CLIP or CISP.
  • Figure 3: Examples of image-guided shape generation with CISP-guided and CLIP-guided models. We also report a point cloud generation from LION lion, also guided with CLIP. Notice how all CLIP-guided models are biased towards the same structural mistakes (e.g., chair backrest hole, airplane tail engines).
  • Figure 4: We interpolate embeddings between Start and End Images and generate shapes with our CISP-guided DDPM (orange) and CLIP-guided DDPM (grey). The CISP-guided model displays a smoother transition in terms of structural 3D components. The chair in orange slowly grows wheels and armrests and mutates its backrest, as opposed to a sharp style change in the chair in gray. Similarly, the orange racecar slowly changes height, wheel size, and overall shape to become a monster truck, in contrast with the abrupt change seen in gray.
  • Figure 5: Generation results from hand-drawn sketches, proving both models' generalization capabilities and highlighting CISP's higher attention to structural details.
  • ...and 1 more figures