Table of Contents
Fetching ...

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh Tran

TL;DR

Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and the proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of the CONSTANT method over state-of-the-art approaches.

Abstract

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

TL;DR

Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and the proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of the CONSTANT method over state-of-the-art approaches.

Abstract

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
Paper Structure (46 sections, 8 equations, 20 figures, 11 tables)

This paper contains 46 sections, 8 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Our method leverages the idea of vector quantization to learn the embedding space of diverse style concepts, enabling better adaptation to new styles than other approaches.
  • Figure 1: User preference study results
  • Figure 2: Overall architecture of our method. a) The training pipeline, the model is optimized using objectives: $L_{denoising}$, $L_{SCE}$ and $L_{LatentPCE}$ and $L_{SAQ}$, b) Architecture of our SAQ module, with Inception-V3 backbone, codebook embedding and a Attention Pool module, c) Sampling pipeline of our method.
  • Figure 2: User plausibility study results
  • Figure 3: Visualization of our $L_{LatentPCE}$ objective in bidirectional format including two sub contrastive loss. First loss receives (anchor, positive, negatives), second loss receives (anchor, positive, negative).
  • ...and 15 more figures