Table of Contents
Fetching ...

3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization

SeungJeh Chung, JooHyun Park, HyeongYeop Kang

TL;DR

3DStyleGLIP enables fine-grained, text-guided stylization of 3D objects by grounding and manipulating individual mesh parts in GLIP's embedding space. It jointly learns part localization and appearance control through SVBRDF, normals, and lighting modeled by neural fields and spherical Gaussians, guided by text prompts that couple style and part phrases. The method introduces a part-level style loss in GLIP space and an optional CLIP-based alternating objective with multi-view fine-tuning, achieving high-quality, part-specific stylizations with robust performance and stability. Experimental results on diverse meshes show superior part-tailored results and user-perceived alignment to prompts compared with baselines, underscoring practical value for customizable 3D content creation.

Abstract

3D stylization, the application of specific styles to three-dimensional objects, offers substantial commercial potential by enabling the creation of uniquely styled 3D objects tailored to diverse scenes. Recent advancements in artificial intelligence and text-driven manipulation methods have made the stylization process increasingly intuitive and automated. While these methods reduce human costs by minimizing reliance on manual labor and expertise, they predominantly focus on holistic stylization, neglecting the application of desired styles to individual components of a 3D object. This limitation restricts the fine-grained controllability. To address this gap, we introduce 3DStyleGLIP, a novel framework specifically designed for text-driven, part-tailored 3D stylization. Given a 3D mesh and a text prompt, 3DStyleGLIP utilizes the vision-language embedding space of the Grounded Language-Image Pre-training (GLIP) model to localize individual parts of the 3D mesh and modify their appearance to match the styles specified in the text prompt. 3DStyleGLIP effectively integrates part localization and stylization guidance within GLIP's shared embedding space through an end-to-end process, enabled by part-level style loss and two complementary learning techniques. This neural methodology meets the user's need for fine-grained style editing and delivers high-quality part-specific stylization results, opening new possibilities for customization and flexibility in 3D content creation. Our code and results are available at https://github.com/sj978/3DStyleGLIP.

3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization

TL;DR

3DStyleGLIP enables fine-grained, text-guided stylization of 3D objects by grounding and manipulating individual mesh parts in GLIP's embedding space. It jointly learns part localization and appearance control through SVBRDF, normals, and lighting modeled by neural fields and spherical Gaussians, guided by text prompts that couple style and part phrases. The method introduces a part-level style loss in GLIP space and an optional CLIP-based alternating objective with multi-view fine-tuning, achieving high-quality, part-specific stylizations with robust performance and stability. Experimental results on diverse meshes show superior part-tailored results and user-perceived alignment to prompts compared with baselines, underscoring practical value for customizable 3D content creation.

Abstract

3D stylization, the application of specific styles to three-dimensional objects, offers substantial commercial potential by enabling the creation of uniquely styled 3D objects tailored to diverse scenes. Recent advancements in artificial intelligence and text-driven manipulation methods have made the stylization process increasingly intuitive and automated. While these methods reduce human costs by minimizing reliance on manual labor and expertise, they predominantly focus on holistic stylization, neglecting the application of desired styles to individual components of a 3D object. This limitation restricts the fine-grained controllability. To address this gap, we introduce 3DStyleGLIP, a novel framework specifically designed for text-driven, part-tailored 3D stylization. Given a 3D mesh and a text prompt, 3DStyleGLIP utilizes the vision-language embedding space of the Grounded Language-Image Pre-training (GLIP) model to localize individual parts of the 3D mesh and modify their appearance to match the styles specified in the text prompt. 3DStyleGLIP effectively integrates part localization and stylization guidance within GLIP's shared embedding space through an end-to-end process, enabled by part-level style loss and two complementary learning techniques. This neural methodology meets the user's need for fine-grained style editing and delivers high-quality part-specific stylization results, opening new possibilities for customization and flexibility in 3D content creation. Our code and results are available at https://github.com/sj978/3DStyleGLIP.
Paper Structure (19 sections, 6 equations, 7 figures, 2 tables)

This paper contains 19 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The 3DStyleGLIP Pipeline. Beginning with the input of text prompts and a 3D mesh, the neural field and spherical Gaussian functions for SVBRDF, normal, and lighting conditions are trained to apply specific styles to individual parts of the 3D mesh.
  • Figure 2: The joint training of part localization and stylization guidance in a shared embedding space.
  • Figure 3: The visualization of the multi-view fine-tuning. (a) Uniformly distributed multiple viewpoints. (b) The comparison of GLIP Performance before and after the fine-tuning. The floating-point numbers indicate the detection accuracy of each bounding box, ranging from 0 to 1.
  • Figure 4: Comparative visualization of stylization results from eight different methodologies. This demonstrates each method's capability to recognize and apply distinct styles to individual parts of 3D meshes. The applied text prompts T are "gold flower and silver stem" for the rose, "diamond shell, gold legs, and sapphire claws" for the crab, and "lava head and twisted-leather handle" for the hammer. To account for varying responsiveness to text prompts across methods, we tested multiple formulations and selected the optimal prompt for each method. Examples include: "a DSLR photo of a object type made of T", "an image of a object type made of T", and "object type, T".
  • Figure 5: A variety of examples of part-tailored stylization outcomes generated by 3DStyleGLIP, illustrating the framework's capability to produce high-quality results across a range of scenarios.
  • ...and 2 more figures