Table of Contents
Fetching ...

PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Preferred Elements, :, Kenshin Abe, Kaizaburo Chubachi, Yasuhiro Fujita, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Hiroyoshi Komatsu, Hiroaki Mikami, Tsuguo Mogami, Shogo Murai, Kosuke Nakago, Daisuke Nishino, Toru Ogawa, Daisuke Okanohara, Yoshihiko Ozaki, Shotaro Sano, Shuji Suzuki, Tianqi Xu, Toshihiko Yanase

Abstract

We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at https://huggingface.co/pfnet/plamo-100b.

PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Abstract

We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at https://huggingface.co/pfnet/plamo-100b.

Paper Structure

This paper contains 32 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Self-attention layers with QK normalization
  • Figure 2: Difference in the z-loss values between using bfloat16 and FP8 for the lm-head.
  • Figure 3: Summary diagram of the training pipeline.