Breeze-7B Technical Report
Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu
TL;DR
Breeze-7B introduces a Traditional Chinese–focused LLM built on Mistral-7B, with extensive data curation, a customized Chinese tokenizer, and long-context pretraining up to $32k$ tokens. It adds instruction finetuning to improve chat and Q&A capabilities, and evaluates on language comprehension, chatbot benchmarks, and long-context tasks, showing competitive performance among open-source 7B-class models. The work emphasizes data quality, efficiency gains from the extended tokenizer, and robust long-context behavior, while highlighting open-source releases Breeze-7B-Base and Breeze-7B-Instruct to spur community development in Traditional Chinese NLP. These contributions advance accessible, high-performing Traditional Chinese LLMs with practical implications for dialogue systems and document-level understanding.
Abstract
Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.
