Table of Contents
Fetching ...

Ruri: Japanese General Text Embeddings

Hayato Tsukagoshi, Ryohei Sasano

TL;DR

This report provides a detailed account of the development of Ruri, a series of Japanese general text embedding models, including the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Abstract

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Ruri: Japanese General Text Embeddings

TL;DR

This report provides a detailed account of the development of Ruri, a series of Japanese general text embedding models, including the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Abstract

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.
Paper Structure (32 sections, 15 tables)