Table of Contents
Fetching ...

Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation

Kaiyuan Liu, Jiahao Mei, Hengyu Zhang, Yihuai Zhang, Daoguo Dong, Liang He

TL;DR

The paper introduces Moyun, a diffusion-based Chinese calligraphy generator that achieves controllable output by conditioning on three labels: calligrapher, font, and character. By replacing the U-Net backbone with Vision Mamba and employing a TripleLabel conditioning mechanism, Moyun enables zero-shot composition across new label combinations. A large-scale Mobao dataset with 1.93 million binarized images and SAMSAM-based binarization supports robust learning and evaluation. Quantitative and qualitative results show improved structural fidelity (IoU, PSNR) and competitive stylistic accuracy in human assessments. This work advances controllable, culturally faithful calligraphy generation for digital heritage and artistic design.

Abstract

Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model 'Moyun' , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset 'Mobao' of over 1.9 million images, and the results demonstrate that 'Moyun' can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, 'Moyun' can generate calligraphy that matches the style of the calligrapher.

Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation

TL;DR

The paper introduces Moyun, a diffusion-based Chinese calligraphy generator that achieves controllable output by conditioning on three labels: calligrapher, font, and character. By replacing the U-Net backbone with Vision Mamba and employing a TripleLabel conditioning mechanism, Moyun enables zero-shot composition across new label combinations. A large-scale Mobao dataset with 1.93 million binarized images and SAMSAM-based binarization supports robust learning and evaluation. Quantitative and qualitative results show improved structural fidelity (IoU, PSNR) and competitive stylistic accuracy in human assessments. This work advances controllable, culturally faithful calligraphy generation for digital heritage and artistic design.

Abstract

Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model 'Moyun' , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset 'Mobao' of over 1.9 million images, and the results demonstrate that 'Moyun' can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, 'Moyun' can generate calligraphy that matches the style of the calligrapher.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) shows the character "bai" (which means "white" in Chinese) written in different fonts by various calligraphers. Each column represents a different font, and each row corresponds to a different calligrapher. (b) The first row shows calligraphy generated by Calliffusion, while the second row shows the ground truth. In the first column (san, regular script, Yan Zhenqing), the strokes in the ground truth are evenly spaced, but Calliffusion's result is not. In the second column (ba, cursive script, Su Shi), Calliffusion's output appears less stable than the ground truth from an aesthetic standpoint.
  • Figure 2: "Moyun" architecture. The input latent noise is patched. The label is a combination of the calligrapher, font, and character. We used Mamba2-Replacement-Vision Mamba to process the patches.
  • Figure 3: (a) shows the directory structure of dataset "Mobao", using the calligrapher "Fan Zhongyan" as an example. "Zu" and "Zi" only have single images, while "Bai" has multiple images. (b) demonstrates the binarization process, using the character "Zu" as an example to show the steps of selecting points, obtaining the mask, and resizing.
  • Figure 4: (a) indicates the number of calligraphy per calligrapher.(y-axis: $\ln$ of calligraphy count) (b) indicates the number of calligraphy per character. Both the red and green lines mark the top $10\%$ and $50\%$ thresholds, respectively, highlighting the long-tail data imbalance.
  • Figure 5: Generated calligraphy. Each column with different labels. The first row shows calligraphy generated by the model which were unseen before. and the second row is ground truth.