Table of Contents
Fetching ...

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

Suzhen Wang, Weijie Chen, Wei Zhang, Minda Zhao, Lincheng Li, Rongsheng Zhang, Zhipeng Hu, Xin Yu

TL;DR

EasyCraft tackles cross-engine avatar crafting by unifying image representations through a self-supervised ViT encoder and learning an engine-specific translator to output crafting parameters. It enables both photo- and text-based avatar creation by integrating a translator with a text-to-image path built on Stable Diffusion trained to mimic engine style. The key contributions include a universal feature extractor, an engine-agnostic translator trained solely on engine data, and an engine-aligned text-to-face image model that together achieve state-of-the-art results on two RPG engines. This framework improves generalizability across different avatar engines and input styles, enabling real-time, versatile avatar creation in games.

Abstract

Character customization, or 'face crafting,' is a vital feature in role-playing games (RPGs), enhancing player engagement by enabling the creation of personalized avatars. Existing automated methods often struggle with generalizability across diverse game engines due to their reliance on the intermediate constraints of specific image domain and typically support only one type of input, either text or image. To overcome these challenges, we introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by uniquely supporting both text and image inputs. Our approach employs a translator capable of converting facial images of any style into crafting parameters. We first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset, enabling photos of any style to be embedded into a unified feature representation. Subsequently, we map this unified feature distribution to crafting parameters specific to a game engine, a process that can be easily adapted to most game engines and thus enhances EasyCraft's generalizability. By integrating text-to-image techniques with our translator, EasyCraft also facilitates precise, text-based character crafting. EasyCraft's ability to integrate diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

TL;DR

EasyCraft tackles cross-engine avatar crafting by unifying image representations through a self-supervised ViT encoder and learning an engine-specific translator to output crafting parameters. It enables both photo- and text-based avatar creation by integrating a translator with a text-to-image path built on Stable Diffusion trained to mimic engine style. The key contributions include a universal feature extractor, an engine-agnostic translator trained solely on engine data, and an engine-aligned text-to-face image model that together achieve state-of-the-art results on two RPG engines. This framework improves generalizability across different avatar engines and input styles, enabling real-time, versatile avatar creation in games.

Abstract

Character customization, or 'face crafting,' is a vital feature in role-playing games (RPGs), enhancing player engagement by enabling the creation of personalized avatars. Existing automated methods often struggle with generalizability across diverse game engines due to their reliance on the intermediate constraints of specific image domain and typically support only one type of input, either text or image. To overcome these challenges, we introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by uniquely supporting both text and image inputs. Our approach employs a translator capable of converting facial images of any style into crafting parameters. We first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset, enabling photos of any style to be embedded into a unified feature representation. Subsequently, we map this unified feature distribution to crafting parameters specific to a game engine, a process that can be easily adapted to most game engines and thus enhances EasyCraft's generalizability. By integrating text-to-image techniques with our translator, EasyCraft also facilitates precise, text-based character crafting. EasyCraft's ability to integrate diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of EasyCraft. The proposed method can achieve both (a) photo-based avatar auto-creation using any style of photo input, and (b) text-based avatar auto-creation from text descriptions.
  • Figure 2: Illustration of EasyCraft. (a) We first employ self-supervised learning to develop a universal vision transformer (ViT) encoder using a large-scale dataset containing various styles of photos. (b) We then train an engine-specific translator $\mathcal{T}$ that can convert input images into specific avatar crafting parameters. Our translator consists of a ViT encoder $\mathcal{T}_e$ and a parameter generation module $\mathcal{T}_p$. During this training process, only $\mathcal{T}_p$ is trained, while $\mathcal{T}_e$ is initialized from the pretrained ViT encoder and remains frozen. (c) Once the translator is obtained, we can directly perform photo-based automatic avatar creation. (d) By integrating our SD model, which can generate engine-style photos based on text, our method also facilitates text-based automatic avatar creation.
  • Figure 3: Qualitative comparisons with photo-based methods. For each input image, we show results of three methods on two engines.
  • Figure 4: Qualitative comparisons with text-based methods on two avatar customization engines. For each method, we present the results of executing the same text input three times.
  • Figure 5: Qualitative evaluations without pretraining of the VIT encoder. The bottom row depicts the generated avatars.
  • ...and 1 more figures