Breeze-7B Technical Report

Chan-Jan Hsu; Chang-Le Liu; Feng-Ting Liao; Po-Chun Hsu; Yi-Chang Chen; Da-Shan Shiu

Breeze-7B Technical Report

Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu

TL;DR

Breeze-7B introduces a Traditional Chinese–focused LLM built on Mistral-7B, with extensive data curation, a customized Chinese tokenizer, and long-context pretraining up to $32k$ tokens. It adds instruction finetuning to improve chat and Q&A capabilities, and evaluates on language comprehension, chatbot benchmarks, and long-context tasks, showing competitive performance among open-source 7B-class models. The work emphasizes data quality, efficiency gains from the extended tokenizer, and robust long-context behavior, while highlighting open-source releases Breeze-7B-Base and Breeze-7B-Instruct to spur community development in Traditional Chinese NLP. These contributions advance accessible, high-performing Traditional Chinese LLMs with practical implications for dialogue systems and document-level understanding.

Abstract

Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.

Breeze-7B Technical Report

TL;DR

Breeze-7B introduces a Traditional Chinese–focused LLM built on Mistral-7B, with extensive data curation, a customized Chinese tokenizer, and long-context pretraining up to

tokens. It adds instruction finetuning to improve chat and Q&A capabilities, and evaluates on language comprehension, chatbot benchmarks, and long-context tasks, showing competitive performance among open-source 7B-class models. The work emphasizes data quality, efficiency gains from the extended tokenizer, and robust long-context behavior, while highlighting open-source releases Breeze-7B-Base and Breeze-7B-Instruct to spur community development in Traditional Chinese NLP. These contributions advance accessible, high-performing Traditional Chinese LLMs with practical implications for dialogue systems and document-level understanding.

Abstract

Paper Structure (19 sections, 2 figures, 6 tables)

This paper contains 19 sections, 2 figures, 6 tables.

Introduction
Method
Model architecture customization
Training
Long Context Pretraining
Instruction Finetuning
Benchmarks
Language comprehension benchmarks
Chatbot-oriented benchmarks
Long-context benchmarks
Results
Models evaluated
Results
Traditional Chinese Language Comprehension benchmarks
Traditional Chinese Chatbot-oriented benchmarks
...and 4 more sections

Figures (2)

Figure 1: Perplexity (PPL) change during the additional pretraining stage of Breeze-7B, after the vocabulary size extension. The PPL scores are calculated using our proprietary Traditional Chinese validation dataset.
Figure 2: Passkey Retrieval results of Breeze-7B-Base and Breeze-7B-32k-Base. The y-axis denotes the input sequence length, while the x-axis denotes the depth of the key position in the example. Each length-depth combination is trialed 20 times and the accuracy is color-coded with the colormap at the bottom.

Breeze-7B Technical Report

TL;DR

Abstract

Breeze-7B Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (2)