Table of Contents
Fetching ...

CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM

Chengyue Yu, Lei Zang, Jiaotuan Wang, Chenyi Zhuang, Jinjie Gu

TL;DR

Puned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday."

Abstract

Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday." CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc.

CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM

TL;DR

Puned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday."

Abstract

Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday." CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc.
Paper Structure (21 sections, 2 equations, 7 figures, 2 tables)

This paper contains 21 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Poem generated by GPT-4. The poem violates the format requirement of Rumengling with 6 excess characters.
  • Figure 2: Generation process of a token-based model vs. a token-free model: (a) In a token-based model, the system may output more than one character at a time, resulting in difficulty in exerting precise control over the number of characters. (b) In a token-free model, the system outputs at most one character at a time, making control over the number of characters easier.
  • Figure 3: Prune a token-based model into a token-free model. (a) For Input, long tokens will be removed from the vocabulary. Text would only be tokenized into character-level or byte-level tokens; The embeddings of long tokens will never be accessed. (b) Transformer structure is left unchanged. (c) For Output, the logits of long tokens will be set to a large negative number and the probabilities of long tokens will be zero. The language model head would never produce long tokens.
  • Figure 4: The user interface and generated poetry sample of CharPoet.
  • Figure 5: Evaluation on Content Quality under the Keyword Setting.
  • ...and 2 more figures