Table of Contents
Fetching ...

Breeze-7B Technical Report

Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu

TL;DR

Breeze-7B introduces a Traditional Chinese–focused LLM built on Mistral-7B, with extensive data curation, a customized Chinese tokenizer, and long-context pretraining up to $32k$ tokens. It adds instruction finetuning to improve chat and Q&A capabilities, and evaluates on language comprehension, chatbot benchmarks, and long-context tasks, showing competitive performance among open-source 7B-class models. The work emphasizes data quality, efficiency gains from the extended tokenizer, and robust long-context behavior, while highlighting open-source releases Breeze-7B-Base and Breeze-7B-Instruct to spur community development in Traditional Chinese NLP. These contributions advance accessible, high-performing Traditional Chinese LLMs with practical implications for dialogue systems and document-level understanding.

Abstract

Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.

Breeze-7B Technical Report

TL;DR

Breeze-7B introduces a Traditional Chinese–focused LLM built on Mistral-7B, with extensive data curation, a customized Chinese tokenizer, and long-context pretraining up to tokens. It adds instruction finetuning to improve chat and Q&A capabilities, and evaluates on language comprehension, chatbot benchmarks, and long-context tasks, showing competitive performance among open-source 7B-class models. The work emphasizes data quality, efficiency gains from the extended tokenizer, and robust long-context behavior, while highlighting open-source releases Breeze-7B-Base and Breeze-7B-Instruct to spur community development in Traditional Chinese NLP. These contributions advance accessible, high-performing Traditional Chinese LLMs with practical implications for dialogue systems and document-level understanding.

Abstract

Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.
Paper Structure (19 sections, 2 figures, 6 tables)

This paper contains 19 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Perplexity (PPL) change during the additional pretraining stage of Breeze-7B, after the vocabulary size extension. The PPL scores are calculated using our proprietary Traditional Chinese validation dataset.
  • Figure 2: Passkey Retrieval results of Breeze-7B-Base and Breeze-7B-32k-Base. The y-axis denotes the input sequence length, while the x-axis denotes the depth of the key position in the example. Each length-depth combination is trialed 20 times and the accuracy is color-coded with the colormap at the bottom.