Table of Contents
Fetching ...

CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

Tianshuo Xu, Tiantian Hong, Zhifei Chen, Fei Chao, Ying-cong Chen

Abstract

Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.

CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

Abstract

Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text Layout Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.
Paper Structure (39 sections, 15 equations, 19 figures, 5 tables)

This paper contains 39 sections, 15 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Page-Level Generation and Editing.(Left) CalliMaster synthesizes high-fidelity Chinese calligraphy that harmonizes local glyph precision with global spatial rhythm. (Right) Interactive semantic re-planning. By adjusting bounding boxes ($\square$), users can modify the layout while the model regenerates continuous inter-character strokes ($\bigcirc$) and optimizes the surrounding void space to preserve artistic momentum.
  • Figure 2: Overview of CalliMaster. The framework decouples calligraphy generation into two core stages: spatial planning and content writing. Inference executes these sequentially: planning the layout before writing the content. Training randomly samples between these two primary stages and two auxiliary states to jointly optimize the model. To enable this unified strategy, the core MF-DiT block employs modality-aware AdaLN driven by independent timesteps ($t_c$, $t_b$, $t_i$), structural attention masking, and modulate embeddings.
  • Figure 3: Qualitative comparison of CalliMaster against state-of-the-art models. A red cross $\boldsymbol{\times}$ marks incorrectly rendered or hallucinated characters, and a red circle $\bigcirc$ indicates omitted characters. Yellow boxes $\square$ highlight continuous strokes.
  • Figure 4: Semantic Layout Editing via Geometric Prompts. CalliMaster supports diverse interactive operations (replacement, scaling, deletion, and insertion). By manipulating bounding boxes, the model re-harmonizes continuous strokes and adjusts the spatial composition to maintain visual rhythm.
  • Figure 5: Per-timestep DRS at each noise scale. Low $t$ probes stroke-level fidelity; high $t$ probes global layout plausibility.
  • ...and 14 more figures