Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation
Zihao Chen, Hisashi Handa, Miho Ohsaki, Kimiaki Shirahama
TL;DR
This work tackles the scarcity of large, labeled data for domain-specific Japanese sentence embeddings by introducing SDJC, a self-supervised framework that generates domain-relevant hard negative sentences via a fine-tuned T5 data generator and trains embeddings through contrastive learning. A key insight is that nouns are the most impactful content words in Japanese for semantic similarity, guiding the generation of hard negatives. The authors construct a comprehensive Japanese STS benchmark (JSTS) by translating English STS datasets and combining them with Japanese corpora (JSICK and JGLUE), enabling robust evaluation of sentence-embedding methods in Japanese. Experiments on clinical STS and educational information retrieval show that SDJC improves domain-specific embeddings, and further gains are achieved with additional fine-tuning on in-domain or general-domain data, underlining the practical value of semi-supervised domain adaptation for low-resource languages. The work also provides a public GitHub repo with datasets, codes, and adapted backbones, facilitating broader adoption and benchmarking.
Abstract
Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.
