Table of Contents
Fetching ...

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano, Atsushi Keyaki, Keisuke Kiryu, Hirokazu Kiyomaru, Takashi Kodama, Takahiro Kubo, Yohei Kuga, Ryoma Kumon, Shuhei Kurita, Sadao Kurohashi, Conglong Li, Taiki Maekawa, Hiroshi Matsuda, Yusuke Miyao, Kentaro Mizuki, Sakae Mizuki, Yugo Murawaki, Akim Mousterou, Ryo Nakamura, Taishi Nakamura, Kouta Nakayama, Tomoka Nakazato, Takuro Niitsuma, Jiro Nishitoba, Yusuke Oda, Hayato Ogawa, Takumi Okamoto, Naoaki Okazaki, Yohei Oseki, Shintaro Ozaki, Koki Ryu, Rafal Rzepka, Keisuke Sakaguchi, Shota Sasaki, Satoshi Sekine, Kohei Suda, Saku Sugawara, Issa Sugiura, Hiroaki Sugiyama, Hisami Suzuki, Jun Suzuki, Toyotaro Suzumura, Kensuke Tachibana, Yu Takagi, Kyosuke Takami, Koichi Takeda, Masashi Takeshita, Masahiro Tanaka, Kenjiro Taura, Arseny Tolmachev, Nobuhiro Ueda, Zhen Wan, Shuntaro Yada, Sakiko Yahata, Yuya Yamamoto, Yusuke Yamauchi, Hitomi Yanaka, Rio Yokota, Koichiro Yoshino

TL;DR

LLM-jp addresses Japan's need for open, high-quality Japanese LLMs by organizing a large-scale, open collaboration across academia and industry. The project delivers end-to-end capabilities—from corpus construction and tokenizer development to pre-training, fine-tuning, evaluation, and safety data creation—producing 13B parameter model suites (v1.0 and v2.0) with publicly released corpora and tuning data. It introduces modular WGs (Corpus Building, Model Building, Fine-tuning and Evaluation, Computational Infrastructure, Safety) and open tools (llm-jp-eval, AnswerCarefully), and engages in cross-domain safety research (JBBQ, toxicity) and international collaboration. While demonstrating improved performance with v2.0 configurations, the initiative also acknowledges safety challenges and plans for larger-scale models (175B) and diverse corpora to enhance robustness and societal alignment, aiming to establish a national LLM R&D hub in Japan.

Abstract

This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

TL;DR

LLM-jp addresses Japan's need for open, high-quality Japanese LLMs by organizing a large-scale, open collaboration across academia and industry. The project delivers end-to-end capabilities—from corpus construction and tokenizer development to pre-training, fine-tuning, evaluation, and safety data creation—producing 13B parameter model suites (v1.0 and v2.0) with publicly released corpora and tuning data. It introduces modular WGs (Corpus Building, Model Building, Fine-tuning and Evaluation, Computational Infrastructure, Safety) and open tools (llm-jp-eval, AnswerCarefully), and engages in cross-domain safety research (JBBQ, toxicity) and international collaboration. While demonstrating improved performance with v2.0 configurations, the initiative also acknowledges safety challenges and plans for larger-scale models (175B) and diverse corpora to enhance robustness and societal alignment, aiming to establish a national LLM R&D hub in Japan.

Abstract

This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.
Paper Structure (29 sections, 2 figures, 16 tables)